date:20171201

[GitHub] spark pull request #19841: [SPARK-22642][SQL] the createdTempDir will not be...

2017-12-01 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/19841#discussion_r154490891
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala
 ---
@@ -104,147 +105,153 @@ case class InsertIntoHiveTable(
 val partitionColumns = 
fileSinkConf.getTableInfo.getProperties.getProperty("partition_columns")
 val partitionColumnNames = 
Option(partitionColumns).map(_.split("/")).getOrElse(Array.empty)
 
-// By this time, the partition map must match the table's partition 
columns
-if (partitionColumnNames.toSet != partition.keySet) {
-  throw new SparkException(
-s"""Requested partitioning does not match the 
${table.identifier.table} table:
-   |Requested partitions: ${partition.keys.mkString(",")}
-   |Table partitions: 
${table.partitionColumnNames.mkString(",")}""".stripMargin)
-}
-
-// Validate partition spec if there exist any dynamic partitions
-if (numDynamicPartitions > 0) {
-  // Report error if dynamic partitioning is not enabled
-  if (!hadoopConf.get("hive.exec.dynamic.partition", 
"true").toBoolean) {
-throw new 
SparkException(ErrorMsg.DYNAMIC_PARTITION_DISABLED.getMsg)
+def processInsert = {
+  // By this time, the partition map must match the table's partition 
columns
+  if (partitionColumnNames.toSet != partition.keySet) {
+throw new SparkException(
+  s"""Requested partitioning does not match the 
${table.identifier.table} table:
+ |Requested partitions: ${partition.keys.mkString(",")}
+ |Table partitions: 
${table.partitionColumnNames.mkString(",")}""".stripMargin)
   }
 
-  // Report error if dynamic partition strict mode is on but no static 
partition is found
-  if (numStaticPartitions == 0 &&
-hadoopConf.get("hive.exec.dynamic.partition.mode", 
"strict").equalsIgnoreCase("strict")) {
-throw new 
SparkException(ErrorMsg.DYNAMIC_PARTITION_STRICT_MODE.getMsg)
-  }
+  // Validate partition spec if there exist any dynamic partitions
+  if (numDynamicPartitions > 0) {
+// Report error if dynamic partitioning is not enabled
+if (!hadoopConf.get("hive.exec.dynamic.partition", 
"true").toBoolean) {
+  throw new 
SparkException(ErrorMsg.DYNAMIC_PARTITION_DISABLED.getMsg)
+}
+
+// Report error if dynamic partition strict mode is on but no 
static partition is found
+if (numStaticPartitions == 0 &&
+  hadoopConf.get("hive.exec.dynamic.partition.mode", 
"strict").equalsIgnoreCase("strict")) {
+  throw new 
SparkException(ErrorMsg.DYNAMIC_PARTITION_STRICT_MODE.getMsg)
+}
 
-  // Report error if any static partition appears after a dynamic 
partition
-  val isDynamic = partitionColumnNames.map(partitionSpec(_).isEmpty)
-  if (isDynamic.init.zip(isDynamic.tail).contains((true, false))) {
-throw new 
AnalysisException(ErrorMsg.PARTITION_DYN_STA_ORDER.getMsg)
+// Report error if any static partition appears after a dynamic 
partition
+val isDynamic = partitionColumnNames.map(partitionSpec(_).isEmpty)
+if (isDynamic.init.zip(isDynamic.tail).contains((true, false))) {
+  throw new 
AnalysisException(ErrorMsg.PARTITION_DYN_STA_ORDER.getMsg)
+}
   }
-}
 
-table.bucketSpec match {
-  case Some(bucketSpec) =>
-// Writes to bucketed hive tables are allowed only if user does 
not care about maintaining
-// table's bucketing ie. both "hive.enforce.bucketing" and 
"hive.enforce.sorting" are
-// set to false
-val enforceBucketingConfig = "hive.enforce.bucketing"
-val enforceSortingConfig = "hive.enforce.sorting"
+  table.bucketSpec match {
+case Some(bucketSpec) =>
+  // Writes to bucketed hive tables are allowed only if user does 
not care about maintaining
+  // table's bucketing ie. both "hive.enforce.bucketing" and 
"hive.enforce.sorting" are
+  // set to false
+  val enforceBucketingConfig = "hive.enforce.bucketing"
+  val enforceSortingConfig = "hive.enforce.sorting"
 
-val message = s"Output Hive table ${table.identifier} is bucketed 
but Spark" +
-  "currently does NOT populate bucketed output which is compatible 
with Hive."
+  val message = s"Output Hive table ${table.identifier} is 
bucketed but Spark" +
+"currently does NOT populate bucketed output which is 
compatible with Hive."
 
-if (hadoopConf.get(enforceBucketingConfig, "true").toBoolean ||
-

[GitHub] spark pull request #19841: [SPARK-22642][SQL] the createdTempDir will not be...

2017-12-01 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/19841#discussion_r154490878
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala
 ---
@@ -104,147 +105,153 @@ case class InsertIntoHiveTable(
 val partitionColumns = 
fileSinkConf.getTableInfo.getProperties.getProperty("partition_columns")
 val partitionColumnNames = 
Option(partitionColumns).map(_.split("/")).getOrElse(Array.empty)
 
-// By this time, the partition map must match the table's partition 
columns
-if (partitionColumnNames.toSet != partition.keySet) {
-  throw new SparkException(
-s"""Requested partitioning does not match the 
${table.identifier.table} table:
-   |Requested partitions: ${partition.keys.mkString(",")}
-   |Table partitions: 
${table.partitionColumnNames.mkString(",")}""".stripMargin)
-}
-
-// Validate partition spec if there exist any dynamic partitions
-if (numDynamicPartitions > 0) {
-  // Report error if dynamic partitioning is not enabled
-  if (!hadoopConf.get("hive.exec.dynamic.partition", 
"true").toBoolean) {
-throw new 
SparkException(ErrorMsg.DYNAMIC_PARTITION_DISABLED.getMsg)
+def processInsert = {
--- End diff --

Generally, we do not like nested method.  


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19858: [SPARK-22489][DOC][FOLLOWUP] Update broadcast beh...

2017-12-01 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/19858#discussion_r154490770
  
--- Diff: docs/sql-programming-guide.md ---
@@ -1776,6 +1776,8 @@ options.
 Note that, for DecimalType(38,0)*, the table above 
intentionally does not cover all other combinations of scales and precisions 
because currently we only infer decimal type like `BigInteger`/`BigInt`. For 
example, 1.1 is inferred as double type.
   - In PySpark, now we need Pandas 0.19.2 or upper if you want to use 
Pandas related functionalities, such as `toPandas`, `createDataFrame` from 
Pandas DataFrame, etc.
   - In PySpark, the behavior of timestamp values for Pandas related 
functionalities was changed to respect session timezone. If you want to use the 
old behavior, you need to set a configuration 
`spark.sql.execution.pandas.respectSessionTimeZone` to `False`. See 
[SPARK-22395](https://issues.apache.org/jira/browse/SPARK-22395) for details.
+ 
+ - Since Spark 2.3, when either broadcast hash join or broadcast nested 
loop join is applicable, we prefer to broadcasting the table that is explicitly 
specified in a broadcast hint. For details, see the section [Broadcast 
Hint](#broadcast-hint-for-sql-queries) and 
[SPARK-22489](https://issues.apache.org/jira/browse/SPARK-22489) for details.
--- End diff --

Sorry, there is a duplicate `for details`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19863: [SPARK-22672][TEST][SQL] Move OrcTest to `sql/cor...

2017-12-01 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/19863#discussion_r154490741
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/OrcTest.scala
 ---
@@ -15,18 +15,17 @@
  * limitations under the License.
  */
 
-package org.apache.spark.sql.hive.orc
+package org.apache.spark.sql.execution.datasources
 
 import java.io.File
 
 import scala.reflect.ClassTag
 import scala.reflect.runtime.universe.TypeTag
 
 import org.apache.spark.sql._
-import org.apache.spark.sql.hive.test.TestHiveSingleton
 import org.apache.spark.sql.test.SQLTestUtils
 
-private[sql] trait OrcTest extends SQLTestUtils with TestHiveSingleton {
+abstract class OrcTest extends QueryTest with SQLTestUtils {
--- End diff --

When we have not moved the test suites of ORC, why we need to move out of 
Hive now?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19805: [SPARK-22649][PYTHON][SQL] Adding localCheckpoint...

2017-12-01 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/19805#discussion_r154490687
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -537,9 +537,48 @@ class Dataset[T] private[sql](
*/
   @Experimental
   @InterfaceStability.Evolving
-  def checkpoint(eager: Boolean): Dataset[T] = {
--- End diff --

Could you check the test case of `def checkpoint`? At least we need to add 
a test case.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19865: [SPARK-22668][SQL] Exclude global variables from argumen...

2017-12-01 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19865
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19865: [SPARK-22668][SQL] Exclude global variables from argumen...

2017-12-01 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19865
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84386/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19865: [SPARK-22668][SQL] Exclude global variables from argumen...

2017-12-01 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19865
  
**[Test build #84386 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84386/testReport)**
 for PR 19865 at commit 
[`3b25b5f`](https://github.com/apache/spark/commit/3b25b5f2fdaec7b1e97732f1b8a0bf4c588a44f2).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19865: [SPARK-22668][SQL] Exclude global variables from argumen...

2017-12-01 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19865
  
**[Test build #84386 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84386/testReport)**
 for PR 19865 at commit 
[`3b25b5f`](https://github.com/apache/spark/commit/3b25b5f2fdaec7b1e97732f1b8a0bf4c588a44f2).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19865: [SPARK-22668][SQL] Exclude global variables from ...

2017-12-01 Thread kiszk

GitHub user kiszk opened a pull request:

https://github.com/apache/spark/pull/19865

[SPARK-22668][SQL] Exclude global variables from arguments of method split 
by CodegenContext.splitExpressions()

## What changes were proposed in this pull request?

This PR fixes to make incorrect result when split method by 
`CodegenContext.splitExpressions() ` has global variables in its arguments. 
When variables in arguments are declared as local variables, to assign a value 
to the variable, which has been originally declared as a global varible, 
updates a local variable.

To fix this problem, this PR excludes global variables from arguments of 
the split method.

Before this PR
```
class Test {
  int globalInt;

  void splittedFunction(int globalInt) {
...
globalInt = 2;
  }

  void apply() {
globalInt = 1;
...
splittedFunction(globalInt);
// globalInt should be 2 here since it is 2 if statemetns are not split
  }
}
```

After this PR.
```
class Test {
  int globalInt;

  void splittedFunction() {
...
globalInt = 2;
  }

  void apply() {
globalInt = 1;
...
splittedFunction();
// globalInt is 2
  }
}
```

## How was this patch tested?

Added test case into `CodeGenerationSuite`

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/kiszk/spark SPARK-22668

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19865.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19865


commit 3b25b5f2fdaec7b1e97732f1b8a0bf4c588a44f2
Author: Kazuaki Ishizaki 
Date:   2017-12-02T06:07:03Z

initial commit




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19783: [SPARK-21322][SQL] support histogram in filter cardinali...

2017-12-01 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19783
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84385/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19783: [SPARK-21322][SQL] support histogram in filter cardinali...

2017-12-01 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19783
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19783: [SPARK-21322][SQL] support histogram in filter cardinali...

2017-12-01 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19783
  
**[Test build #84385 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84385/testReport)**
 for PR 19783 at commit 
[`9d2a463`](https://github.com/apache/spark/commit/9d2a463f76ecd164dd15a834aa971a4d222113b6).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19864: [SPARK-22673][SQL] InMemoryRelation should utilize on-di...

2017-12-01 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19864
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19864: [SPARK-22673][SQL] InMemoryRelation should utilize on-di...

2017-12-01 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19864
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84384/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19864: [SPARK-22673][SQL] InMemoryRelation should utilize on-di...

2017-12-01 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19864
  
**[Test build #84384 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84384/testReport)**
 for PR 19864 at commit 
[`2082c0e`](https://github.com/apache/spark/commit/2082c0ed776206db4f064b936d9f560a4633b6a0).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #10803: [SPARK-12875] [ML] Add Weight of Evidence and Informatio...

2017-12-01 Thread hhbyyh

Github user hhbyyh commented on the issue:

https://github.com/apache/spark/pull/10803
  
No it's not merged. Feel free to use the code as you wish.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19845: [SPARK-22651][PYTHON][ML] Prevent initiating mult...

2017-12-01 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/19845


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19845: [SPARK-22651][PYTHON][ML] Prevent initiating multiple Hi...

2017-12-01 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/19845
  
Merged to master.

Thanks for reviewing this @viirya, @jiangxb1987, @dongjoon-hyun and 
@imatiach-msft.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19783: [SPARK-21322][SQL] support histogram in filter cardinali...

2017-12-01 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19783
  
**[Test build #84385 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84385/testReport)**
 for PR 19783 at commit 
[`9d2a463`](https://github.com/apache/spark/commit/9d2a463f76ecd164dd15a834aa971a4d222113b6).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19864: [SPARK-22673][SQL] InMemoryRelation should utilize on-di...

2017-12-01 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19864
  
**[Test build #84384 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84384/testReport)**
 for PR 19864 at commit 
[`2082c0e`](https://github.com/apache/spark/commit/2082c0ed776206db4f064b936d9f560a4633b6a0).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19863: [SPARK-22672][TEST][SQL] Move OrcTest to `sql/core`

2017-12-01 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19863
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84381/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19863: [SPARK-22672][TEST][SQL] Move OrcTest to `sql/core`

2017-12-01 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19863
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19863: [SPARK-22672][TEST][SQL] Move OrcTest to `sql/core`

2017-12-01 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19863
  
**[Test build #84381 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84381/testReport)**
 for PR 19863 at commit 
[`a58ceba`](https://github.com/apache/spark/commit/a58ceba87f4e5a2b70d934e45a20030d6ddca3c2).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `abstract class OrcTest extends QueryTest with SQLTestUtils `
  * `class OrcFilterSuite extends OrcTest with TestHiveSingleton `
  * `class OrcQuerySuite extends OrcTest with TestHiveSingleton with 
BeforeAndAfterAll `


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19858: [SPARK-22489][DOC][FOLLOWUP] Update broadcast behavior c...

2017-12-01 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19858
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19858: [SPARK-22489][DOC][FOLLOWUP] Update broadcast behavior c...

2017-12-01 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19858
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84383/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19858: [SPARK-22489][DOC][FOLLOWUP] Update broadcast behavior c...

2017-12-01 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19858
  
**[Test build #84383 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84383/testReport)**
 for PR 19858 at commit 
[`76148f5`](https://github.com/apache/spark/commit/76148f582404a9f5272ae55d9994133020832536).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19841: [SPARK-22642][SQL] the createdTempDir will not be...

2017-12-01 Thread wangyum

Github user wangyum commented on a diff in the pull request:

https://github.com/apache/spark/pull/19841#discussion_r154480032
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala
 ---
@@ -104,147 +105,153 @@ case class InsertIntoHiveTable(
 val partitionColumns = 
fileSinkConf.getTableInfo.getProperties.getProperty("partition_columns")
 val partitionColumnNames = 
Option(partitionColumns).map(_.split("/")).getOrElse(Array.empty)
 
-// By this time, the partition map must match the table's partition 
columns
-if (partitionColumnNames.toSet != partition.keySet) {
-  throw new SparkException(
-s"""Requested partitioning does not match the 
${table.identifier.table} table:
-   |Requested partitions: ${partition.keys.mkString(",")}
-   |Table partitions: 
${table.partitionColumnNames.mkString(",")}""".stripMargin)
-}
-
-// Validate partition spec if there exist any dynamic partitions
-if (numDynamicPartitions > 0) {
-  // Report error if dynamic partitioning is not enabled
-  if (!hadoopConf.get("hive.exec.dynamic.partition", 
"true").toBoolean) {
-throw new 
SparkException(ErrorMsg.DYNAMIC_PARTITION_DISABLED.getMsg)
+def processInsert = {
+  // By this time, the partition map must match the table's partition 
columns
+  if (partitionColumnNames.toSet != partition.keySet) {
+throw new SparkException(
+  s"""Requested partitioning does not match the 
${table.identifier.table} table:
+ |Requested partitions: ${partition.keys.mkString(",")}
+ |Table partitions: 
${table.partitionColumnNames.mkString(",")}""".stripMargin)
   }
 
-  // Report error if dynamic partition strict mode is on but no static 
partition is found
-  if (numStaticPartitions == 0 &&
-hadoopConf.get("hive.exec.dynamic.partition.mode", 
"strict").equalsIgnoreCase("strict")) {
-throw new 
SparkException(ErrorMsg.DYNAMIC_PARTITION_STRICT_MODE.getMsg)
-  }
+  // Validate partition spec if there exist any dynamic partitions
+  if (numDynamicPartitions > 0) {
+// Report error if dynamic partitioning is not enabled
+if (!hadoopConf.get("hive.exec.dynamic.partition", 
"true").toBoolean) {
+  throw new 
SparkException(ErrorMsg.DYNAMIC_PARTITION_DISABLED.getMsg)
+}
+
+// Report error if dynamic partition strict mode is on but no 
static partition is found
+if (numStaticPartitions == 0 &&
+  hadoopConf.get("hive.exec.dynamic.partition.mode", 
"strict").equalsIgnoreCase("strict")) {
+  throw new 
SparkException(ErrorMsg.DYNAMIC_PARTITION_STRICT_MODE.getMsg)
+}
 
-  // Report error if any static partition appears after a dynamic 
partition
-  val isDynamic = partitionColumnNames.map(partitionSpec(_).isEmpty)
-  if (isDynamic.init.zip(isDynamic.tail).contains((true, false))) {
-throw new 
AnalysisException(ErrorMsg.PARTITION_DYN_STA_ORDER.getMsg)
+// Report error if any static partition appears after a dynamic 
partition
+val isDynamic = partitionColumnNames.map(partitionSpec(_).isEmpty)
+if (isDynamic.init.zip(isDynamic.tail).contains((true, false))) {
+  throw new 
AnalysisException(ErrorMsg.PARTITION_DYN_STA_ORDER.getMsg)
+}
   }
-}
 
-table.bucketSpec match {
-  case Some(bucketSpec) =>
-// Writes to bucketed hive tables are allowed only if user does 
not care about maintaining
-// table's bucketing ie. both "hive.enforce.bucketing" and 
"hive.enforce.sorting" are
-// set to false
-val enforceBucketingConfig = "hive.enforce.bucketing"
-val enforceSortingConfig = "hive.enforce.sorting"
+  table.bucketSpec match {
+case Some(bucketSpec) =>
+  // Writes to bucketed hive tables are allowed only if user does 
not care about maintaining
+  // table's bucketing ie. both "hive.enforce.bucketing" and 
"hive.enforce.sorting" are
+  // set to false
+  val enforceBucketingConfig = "hive.enforce.bucketing"
+  val enforceSortingConfig = "hive.enforce.sorting"
 
-val message = s"Output Hive table ${table.identifier} is bucketed 
but Spark" +
-  "currently does NOT populate bucketed output which is compatible 
with Hive."
+  val message = s"Output Hive table ${table.identifier} is 
bucketed but Spark" +
+"currently does NOT populate bucketed output which is 
compatible with Hive."
 
-if (hadoopConf.get(enforceBucketingConfig, "true").toBoolean ||
-

[GitHub] spark issue #19858: [SPARK-22489][DOC][FOLLOWUP] Update broadcast behavior c...

2017-12-01 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19858
  
**[Test build #84383 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84383/testReport)**
 for PR 19858 at commit 
[`76148f5`](https://github.com/apache/spark/commit/76148f582404a9f5272ae55d9994133020832536).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19627: [SPARK-21088][ML][WIP] CrossValidator, TrainValidationSp...

2017-12-01 Thread WeichenXu123

Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/19627
  
@jkbradley I think it is better to review #19857 (fix python model specific 
optimization) and merge it first  and then I rebase & update this PR. :)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19848: [SPARK-22162] Executors and the driver should use consis...

2017-12-01 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19848
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84380/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19848: [SPARK-22162] Executors and the driver should use consis...

2017-12-01 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19848
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19848: [SPARK-22162] Executors and the driver should use consis...

2017-12-01 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19848
  
**[Test build #84380 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84380/testReport)**
 for PR 19848 at commit 
[`8e68683`](https://github.com/apache/spark/commit/8e6868395e9291ad8ddf2a0251f7e6163618b132).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19864: [SPARK-22673][SQL] InMemoryRelation should utilize on-di...

2017-12-01 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19864
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19864: [SPARK-22673][SQL] InMemoryRelation should utilize on-di...

2017-12-01 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19864
  
**[Test build #84382 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84382/testReport)**
 for PR 19864 at commit 
[`32f7c74`](https://github.com/apache/spark/commit/32f7c74a9b5cf4f19e7d14357bb87064383e11b3).
 * This patch **fails to build**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19864: [SPARK-22673][SQL] InMemoryRelation should utilize on-di...

2017-12-01 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19864
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84382/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18424: [SPARK-17091] Add rule to convert IN predicate to equiva...

2017-12-01 Thread a10y

Github user a10y commented on the issue:

https://github.com/apache/spark/pull/18424
  
@ptkool are you still tracking this at all?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19864: [SPARK-22673][SQL] InMemoryRelation should utilize on-di...

2017-12-01 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19864
  
**[Test build #84382 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84382/testReport)**
 for PR 19864 at commit 
[`32f7c74`](https://github.com/apache/spark/commit/32f7c74a9b5cf4f19e7d14357bb87064383e11b3).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19864: [SPARK-22673][SQL] InMemoryRelation should utilize on-di...

2017-12-01 Thread CodingCat

Github user CodingCat commented on the issue:

https://github.com/apache/spark/pull/19864
  
@cloud-fan @viirya @gatorsmile @felixcheung @hvanhovell @HyukjinKwon 
@dongjoon-hyun @liancheng 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19864: [SPARK-22673][SQL] InMemoryRelation should utiliz...

2017-12-01 Thread CodingCat

GitHub user CodingCat opened a pull request:

https://github.com/apache/spark/pull/19864

[SPARK-22673][SQL] InMemoryRelation should utilize on-disk table stats 
whenever possible

## What changes were proposed in this pull request?

The current implementation of InMemoryRelation always uses the most 
expensive execution plan when writing cache
With CBO enabled, we can actually have a more exact estimation of the 
underlying table size...

## How was this patch tested?

existing test


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/CodingCat/spark SPARK-22673

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19864.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19864


commit b2fb1d25804b7bdbe1a767306a319dc748965bce
Author: CodingCat 
Date:   2016-03-07T14:37:37Z

improve the doc for "spark.memory.offHeap.size"

commit 0971900d562cb1a18af6f7de02bb8eb95637a640
Author: CodingCat 
Date:   2016-03-07T19:00:16Z

fix

commit 32f7c74a9b5cf4f19e7d14357bb87064383e11b3
Author: CodingCat 
Date:   2017-12-01T23:05:35Z

use cbo stats in inmemoryrelation




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19746: [SPARK-22346][ML] VectorSizeHint Transformer for using V...

2017-12-01 Thread jkbradley

Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/19746
  
@WeichenXu123  From what I've seen, it's more common for people to use 
VectorAssembler to assemble a bunch of Numeric columns, rather than a bunch of 
Vector columns.  I'd recommend we do things incrementally, adding single-column 
support before multi-column support (especially since we're still trying to 
achieve consensus about design for multi-column support, per my recent comment 
in the umbrella JIRA).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19863: [SPARK-22672][TEST][SQL] Move OrcTest to `sql/core`

2017-12-01 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19863
  
**[Test build #84381 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84381/testReport)**
 for PR 19863 at commit 
[`a58ceba`](https://github.com/apache/spark/commit/a58ceba87f4e5a2b70d934e45a20030d6ddca3c2).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19863: [SPARK-22672][TEST][SQL] Move OrcTest to `sql/cor...

2017-12-01 Thread dongjoon-hyun

GitHub user dongjoon-hyun opened a pull request:

https://github.com/apache/spark/pull/19863

[SPARK-22672][TEST][SQL] Move OrcTest to `sql/core`

## What changes were proposed in this pull request?

To support ORC tests without Hive, we had better have `OrcTest` in 
`sql/core` instead of `sql/hive`.

## How was this patch tested?

This is a test suite only change. It should pass all existing tests.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/dongjoon-hyun/spark SPARK-22672

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19863.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19863


commit a58ceba87f4e5a2b70d934e45a20030d6ddca3c2
Author: Dongjoon Hyun 
Date:   2017-12-01T21:58:54Z

[SPARK-22672][TEST][SQL] Move OrcTest to `sql/core`




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19527: [SPARK-13030][ML] Create OneHotEncoderEstimator for OneH...

2017-12-01 Thread jkbradley

Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/19527
  
Question about this PR description comment:
> Note that keep can't be used at the same time with dropLast as true. 
Because they will conflict in encoded vector by producing a vector of zeros.

Why is this necessary?  With ```n``` categories found in fitting, shouldn't 
the behavior be the following?
* ```keep=true, dropLast=true``` ==> vector size n
* ```keep=true, dropLast=false``` ==> vector size n+1
* ```keep=false, dropLast=true``` ==> vector size n-1
* ```keep=false, dropLast=false``` ==> vector size n



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19858: [SPARK-22489][DOC][FOLLOWUP] Update broadcast beh...

2017-12-01 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/19858#discussion_r154453799
  
--- Diff: docs/sql-programming-guide.md ---
@@ -1776,6 +1776,8 @@ options.
 Note that, for DecimalType(38,0)*, the table above 
intentionally does not cover all other combinations of scales and precisions 
because currently we only infer decimal type like `BigInteger`/`BigInt`. For 
example, 1.1 is inferred as double type.
   - In PySpark, now we need Pandas 0.19.2 or upper if you want to use 
Pandas related functionalities, such as `toPandas`, `createDataFrame` from 
Pandas DataFrame, etc.
   - In PySpark, the behavior of timestamp values for Pandas related 
functionalities was changed to respect session timezone. If you want to use the 
old behavior, you need to set a configuration 
`spark.sql.execution.pandas.respectSessionTimeZone` to `False`. See 
[SPARK-22395](https://issues.apache.org/jira/browse/SPARK-22395) for details.
+ 
+ - Since Spark 2.3, broadcast behaviour changed to broadcast the join side 
with an explicit broadcast hint first. See 
[SPARK-22489](https://issues.apache.org/jira/browse/SPARK-22489) for details.
--- End diff --

```
Since Spark 2.3, when either broadcast hash join or broadcast nested loop 
join is applicable, we prefer to broadcasting the table that is explicitly 
specified in a broadcast hint. For details, see the section 
[JDBC/ODBC](#broadcast-hint-for-sql-queries) and 
[SPARK-22489](https://issues.apache.org/jira/browse/SPARK-22489) for details.
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19527: [SPARK-13030][ML] Create OneHotEncoderEstimator f...

2017-12-01 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/19527#discussion_r154452715
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/OneHotEncoder.scala ---
@@ -41,8 +41,12 @@ import org.apache.spark.sql.types.{DoubleType, 
NumericType, StructType}
  * The output vectors are sparse.
  *
  * @see `StringIndexer` for converting categorical values into category 
indices
+ * @deprecated `OneHotEncoderEstimator` will be renamed `OneHotEncoder` 
and this `OneHotEncoder`
--- End diff --

Note for the future: For 3.0, it'd be nice to do what you're describing 
here but also leave OneHotEncoderEstimator as a deprecated alias.  That way, 
user code won't break but will have deprecation warnings when upgrading to 3.0.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19838: [SPARK-22638][SS]Use a separate queue for Streami...

2017-12-01 Thread zsxwing

Github user zsxwing closed the pull request at:

https://github.com/apache/spark/pull/19838


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19838: [SPARK-22638][SS]Use a separate queue for StreamingQuery...

2017-12-01 Thread zsxwing

Github user zsxwing commented on the issue:

https://github.com/apache/spark/pull/19838
  
Thanks! Merging to master.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19651: [SPARK-20682][SPARK-15474][SPARK-21791] Add new ORCFileF...

2017-12-01 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19651
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84378/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19651: [SPARK-20682][SPARK-15474][SPARK-21791] Add new ORCFileF...

2017-12-01 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19651
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19651: [SPARK-20682][SPARK-15474][SPARK-21791] Add new ORCFileF...

2017-12-01 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19651
  
**[Test build #84378 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84378/testReport)**
 for PR 19651 at commit 
[`74cb053`](https://github.com/apache/spark/commit/74cb053ffaaa5c850810cfbd71e76aad38e2ba5b).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class OrcDeserializer(`
  * `  sealed trait CatalystDataUpdater `
  * `  final class RowUpdater(row: InternalRow) extends CatalystDataUpdater 
`
  * `  final class ArrayDataUpdater(array: ArrayData) extends 
CatalystDataUpdater `
  * `class OrcSerializer(dataSchema: StructType) `


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19848: [SPARK-22162] Executors and the driver should use consis...

2017-12-01 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19848
  
**[Test build #84380 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84380/testReport)**
 for PR 19848 at commit 
[`8e68683`](https://github.com/apache/spark/commit/8e6868395e9291ad8ddf2a0251f7e6163618b132).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19045: [WIP][SPARK-20628][CORE] Keep track of nodes (/ spot ins...

2017-12-01 Thread holdenk

Github user holdenk commented on the issue:

https://github.com/apache/spark/pull/19045
  
So it seems like the YARN changes are only going to happen in Hadoop 3+ so 
this might make sense regardless of what happens in 
https://github.com/apache/spark/pull/19267 (since folks like K8 or whoever can 
send the message as desired).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19838: [SPARK-22638][SS]Use a separate queue for StreamingQuery...

2017-12-01 Thread tdas

Github user tdas commented on the issue:

https://github.com/apache/spark/pull/19838
  
LGTM.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19848: [SPARK-22162] Executors and the driver should use consis...

2017-12-01 Thread vanzin

Github user vanzin commented on the issue:

https://github.com/apache/spark/pull/19848
  
LGTM but I'll leave it here a bit for others to take a look.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19858: [SPARK-22489][DOC][FOLLOWUP] Update broadcast behavior c...

2017-12-01 Thread wangyum

Github user wangyum commented on the issue:

https://github.com/apache/spark/pull/19858
  
cc @gatorsmile 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19840: [SPARK-22640][PYSPARK][YARN]switch python exec on execut...

2017-12-01 Thread vanzin

Github user vanzin commented on the issue:

https://github.com/apache/spark/pull/19840
  
Instead of setting `PYSPARK_PYTHON=~/anaconda3/envs/py3/bin/python`, what 
happens if you set `PYSPARK_DRIVER_PYTHON=~/anaconda3/envs/py3/bin/python`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19831: [SPARK-22626][SQL] Wrong Hive table statistics may trigg...

2017-12-01 Thread wangyum

Github user wangyum commented on the issue:

https://github.com/apache/spark/pull/19831
  
Instead of manually setting up table statistics, I'm just trying to 
simulate the statistics for these tables by this way. 
If `totalSize (or rawDataSize) > 0` and `rowCount = 0`, at least one 
parameter is incorrect, and  should not be optimized based on these incorrect 
statistics.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18113: [SPARK-20890][SQL] Added min and max typed aggregation f...

2017-12-01 Thread setjet

Github user setjet commented on the issue:

https://github.com/apache/spark/pull/18113
  
@cloud-fan done, could you please have a look?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19758: [SPARK-3162][MLlib] Local Tree Training Pt 1: Refactor R...

2017-12-01 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19758
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84379/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19758: [SPARK-3162][MLlib] Local Tree Training Pt 1: Refactor R...

2017-12-01 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19758
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19758: [SPARK-3162][MLlib] Local Tree Training Pt 1: Refactor R...

2017-12-01 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19758
  
**[Test build #84379 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84379/testReport)**
 for PR 19758 at commit 
[`5bcccda`](https://github.com/apache/spark/commit/5bcccda6a599f60d2d084ecf871649675a792d5d).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19627: [SPARK-21088][ML][WIP] CrossValidator, TrainValidationSp...

2017-12-01 Thread jkbradley

Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/19627
  
Is this still WIP or ready?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19631: [SPARK-22372][core, yarn] Make cluster submission use Sp...

2017-12-01 Thread tgravescs

Github user tgravescs commented on the issue:

https://github.com/apache/spark/pull/19631
  
+1


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19651: [SPARK-20682][SPARK-15474][SPARK-21791] Add new O...

2017-12-01 Thread dongjoon-hyun

Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/19651#discussion_r154408356
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilters.scala
 ---
@@ -0,0 +1,210 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources.orc
+
+import org.apache.orc.storage.ql.io.sarg.{PredicateLeaf, SearchArgument, 
SearchArgumentFactory}
+import org.apache.orc.storage.ql.io.sarg.SearchArgument.Builder
+import org.apache.orc.storage.serde2.io.HiveDecimalWritable
+
+import org.apache.spark.sql.sources.Filter
+import org.apache.spark.sql.types._
+
+/**
+ * Helper object for building ORC `SearchArgument`s, which are used for 
ORC predicate push-down.
+ *
+ * Due to limitation of ORC `SearchArgument` builder, we had to end up 
with a pretty weird double-
+ * checking pattern when converting `And`/`Or`/`Not` filters.
+ *
+ * An ORC `SearchArgument` must be built in one pass using a single 
builder.  For example, you can't
+ * build `a = 1` and `b = 2` first, and then combine them into `a = 1 AND 
b = 2`.  This is quite
+ * different from the cases in Spark SQL or Parquet, where complex filters 
can be easily built using
+ * existing simpler ones.
+ *
+ * The annoying part is that, `SearchArgument` builder methods like 
`startAnd()`, `startOr()`, and
+ * `startNot()` mutate internal state of the builder instance.  This 
forces us to translate all
+ * convertible filters with a single builder instance. However, before 
actually converting a filter,
+ * we've no idea whether it can be recognized by ORC or not. Thus, when an 
inconvertible filter is
+ * found, we may already end up with a builder whose internal state is 
inconsistent.
+ *
+ * For example, to convert an `And` filter with builder `b`, we call 
`b.startAnd()` first, and then
+ * try to convert its children.  Say we convert `left` child successfully, 
but find that `right`
+ * child is inconvertible.  Alas, `b.startAnd()` call can't be rolled 
back, and `b` is inconsistent
+ * now.
+ *
+ * The workaround employed here is that, for `And`/`Or`/`Not`, we first 
try to convert their
+ * children with brand new builders, and only do the actual conversion 
with the right builder
+ * instance when the children are proven to be convertible.
+ *
+ * P.S.: Hive seems to use `SearchArgument` together with 
`ExprNodeGenericFuncDesc` only.  Usage of
+ * builder methods mentioned above can only be found in test code, where 
all tested filters are
+ * known to be convertible.
+ */
+private[orc] object OrcFilters {
--- End diff --

Yes. It's logically the same with old version. Only API usage is updated 
here.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19651: [SPARK-20682][SPARK-15474][SPARK-21791] Add new O...

2017-12-01 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19651#discussion_r154407217
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilters.scala
 ---
@@ -0,0 +1,210 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources.orc
+
+import org.apache.orc.storage.ql.io.sarg.{PredicateLeaf, SearchArgument, 
SearchArgumentFactory}
+import org.apache.orc.storage.ql.io.sarg.SearchArgument.Builder
+import org.apache.orc.storage.serde2.io.HiveDecimalWritable
+
+import org.apache.spark.sql.sources.Filter
+import org.apache.spark.sql.types._
+
+/**
+ * Helper object for building ORC `SearchArgument`s, which are used for 
ORC predicate push-down.
+ *
+ * Due to limitation of ORC `SearchArgument` builder, we had to end up 
with a pretty weird double-
+ * checking pattern when converting `And`/`Or`/`Not` filters.
+ *
+ * An ORC `SearchArgument` must be built in one pass using a single 
builder.  For example, you can't
+ * build `a = 1` and `b = 2` first, and then combine them into `a = 1 AND 
b = 2`.  This is quite
+ * different from the cases in Spark SQL or Parquet, where complex filters 
can be easily built using
+ * existing simpler ones.
+ *
+ * The annoying part is that, `SearchArgument` builder methods like 
`startAnd()`, `startOr()`, and
+ * `startNot()` mutate internal state of the builder instance.  This 
forces us to translate all
+ * convertible filters with a single builder instance. However, before 
actually converting a filter,
+ * we've no idea whether it can be recognized by ORC or not. Thus, when an 
inconvertible filter is
+ * found, we may already end up with a builder whose internal state is 
inconsistent.
+ *
+ * For example, to convert an `And` filter with builder `b`, we call 
`b.startAnd()` first, and then
+ * try to convert its children.  Say we convert `left` child successfully, 
but find that `right`
+ * child is inconvertible.  Alas, `b.startAnd()` call can't be rolled 
back, and `b` is inconsistent
+ * now.
+ *
+ * The workaround employed here is that, for `And`/`Or`/`Not`, we first 
try to convert their
+ * children with brand new builders, and only do the actual conversion 
with the right builder
+ * instance when the children are proven to be convertible.
+ *
+ * P.S.: Hive seems to use `SearchArgument` together with 
`ExprNodeGenericFuncDesc` only.  Usage of
+ * builder methods mentioned above can only be found in test code, where 
all tested filters are
+ * known to be convertible.
+ */
+private[orc] object OrcFilters {
--- End diff --

I didn't review it careful, just assume it's same with the old version, 
with API update.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19758: [SPARK-3162][MLlib] Local Tree Training Pt 1: Refactor R...

2017-12-01 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19758
  
**[Test build #84379 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84379/testReport)**
 for PR 19758 at commit 
[`5bcccda`](https://github.com/apache/spark/commit/5bcccda6a599f60d2d084ecf871649675a792d5d).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19651: [SPARK-20682][SPARK-15474][SPARK-21791] Add new ORCFileF...

2017-12-01 Thread dongjoon-hyun

Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/19651
  
Thank you so much, @cloud-fan .


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19651: [SPARK-20682][SPARK-15474][SPARK-21791] Add new ORCFileF...

2017-12-01 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19651
  
**[Test build #84378 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84378/testReport)**
 for PR 19651 at commit 
[`74cb053`](https://github.com/apache/spark/commit/74cb053ffaaa5c850810cfbd71e76aad38e2ba5b).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19848: [SPARK-22162] Executors and the driver should use...

2017-12-01 Thread vanzin

Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/19848#discussion_r154403383
  
--- Diff: 
core/src/main/scala/org/apache/spark/mapred/SparkHadoopMapRedUtil.scala ---
@@ -70,7 +70,8 @@ object SparkHadoopMapRedUtil extends Logging {
   if (shouldCoordinateWithDriver) {
 val outputCommitCoordinator = SparkEnv.get.outputCommitCoordinator
 val taskAttemptNumber = TaskContext.get().attemptNumber()
-val canCommit = outputCommitCoordinator.canCommit(jobId, splitId, 
taskAttemptNumber)
+val stageId = TaskContext.get().stageId()
+val canCommit = outputCommitCoordinator.canCommit(stageId, 
splitId, taskAttemptNumber)
--- End diff --

Ah, great, another public class that has no good reason to be public...


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19631: [SPARK-22372][core, yarn] Make cluster submission...

2017-12-01 Thread vanzin

Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/19631#discussion_r154403001
  
--- Diff: project/MimaExcludes.scala ---
@@ -36,6 +36,12 @@ object MimaExcludes {
 
   // Exclude rules for 2.3.x
   lazy val v23excludes = v22excludes ++ Seq(
+// SPARK-22372: Make cluster submission use SparkApplication.
+
ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.deploy.SparkHadoopUtil.getSecretKeyFromUserCredentials"),
--- End diff --

The change is 2.3 only.

I've always questioned why `SparkHadoopUtil` is public in the first place. 
I'm not that worried about removing the functionality in this method because 
Spark apps don't really need to depend on it; it's easy to call the Hadoop API 
directly, and the API did nothing before except in YARN mode.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19789: [SPARK-22562][Streaming] CachedKafkaConsumer unsafe evic...

2017-12-01 Thread daroo

Github user daroo commented on the issue:

https://github.com/apache/spark/pull/19789
  
@HyukjinKwon
@zsxwing 

Any chance to go ahead with this?



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19861: [SPARK-22387][SQL] Propagate session configs to data sou...

2017-12-01 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19861
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19861: [SPARK-22387][SQL] Propagate session configs to data sou...

2017-12-01 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19861
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84377/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19862: [SPARK-22671][SQL] Make SortMergeJoin read less data whe...

2017-12-01 Thread gczsjdy

Github user gczsjdy commented on the issue:

https://github.com/apache/spark/pull/19862
  
cc @cloud-fan @viirya @ConeyLiu 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19861: [SPARK-22387][SQL] Propagate session configs to data sou...

2017-12-01 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19861
  
**[Test build #84377 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84377/testReport)**
 for PR 19861 at commit 
[`eaa6cae`](https://github.com/apache/spark/commit/eaa6cae1fd61f21238acdcfc7477d559cc47).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19862: Make SortMergeJoin read less data when wholeStageCodegen...

2017-12-01 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19862
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19862: Make SortMergeJoin read less data when wholeStage...

2017-12-01 Thread gczsjdy

GitHub user gczsjdy opened a pull request:

https://github.com/apache/spark/pull/19862

Make SortMergeJoin read less data when wholeStageCodegen is off

## What changes were proposed in this pull request?

In SortMergeJoin(with wholeStageCodegen), an optimization already exists: 
if the left table of a partition is empty then there is no need to read the 
right table of this corresponding partition. This benefits the case in which 
many partitions of left table is empty and the right table is big.

While in the code path without wholeStageCodegen, this optimization doesn't 
happen. This is mainly due to the lack of optimization in codegen-SortMergeJoin 
& the lack of support in `SortExec`, which doesn't do lazy evaluation. UI when 
wholeStageCodegen is off:
https://user-images.githubusercontent.com/7685352/33493586-8311ac58-d6fb-11e7-816c-7a0fb2065345.PNG;>

When the switch is on: 

![image](https://user-images.githubusercontent.com/7685352/33493675-c821b81a-d6fb-11e7-8cf8-2e5baff75be3.png)

This PR lazy evaluates sort, and only trigger the right table read after 
reading the left table and find it's not empty.

After this PR, with wholeStageCodegen off:

![image](https://user-images.githubusercontent.com/7685352/33493784-2e1ee89a-d6fc-11e7-8201-71273de0b857.png)

## How was this patch tested?

Existed test suites.



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/gczsjdy/spark smj_less_read

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19862.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19862


commit a8511e8b7e0240c471f88e703545301425960901
Author: GuoChenzhao 
Date:   2017-11-28T09:01:34Z

Init solution

commit a5c14b473636e571c6d4f17798f220be080a27a1
Author: GuoChenzhao 
Date:   2017-11-28T09:31:07Z

Style

commit a26ca57d56b0b2df81daa82da49f4ff564fc10f5
Author: GuoChenzhao 
Date:   2017-11-28T09:37:54Z

Comments

commit 6d875f84de95d67f84ef9774cb6b6ee8273d46a6
Author: GuoChenzhao 
Date:   2017-12-01T11:32:06Z

lazy evaluation in SortExec




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19860: [SPARK-22669][SQL] Avoid unnecessary function calls in c...

2017-12-01 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19860
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19860: [SPARK-22669][SQL] Avoid unnecessary function calls in c...

2017-12-01 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19860
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84376/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19860: [SPARK-22669][SQL] Avoid unnecessary function calls in c...

2017-12-01 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19860
  
**[Test build #84376 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84376/testReport)**
 for PR 19860 at commit 
[`ce74fb8`](https://github.com/apache/spark/commit/ce74fb8baff13332bed8c64db6c2a49be815f81f).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19859: [SPARK-22634][BUILD] Update Bouncy Castle to 1.58

2017-12-01 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19859
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19859: [SPARK-22634][BUILD] Update Bouncy Castle to 1.58

2017-12-01 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19859
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84375/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19859: [SPARK-22634][BUILD] Update Bouncy Castle to 1.58

2017-12-01 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19859
  
**[Test build #84375 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84375/testReport)**
 for PR 19859 at commit 
[`222d369`](https://github.com/apache/spark/commit/222d3690c09fdc4e99e6f2f45e786175d3153049).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19631: [SPARK-22372][core, yarn] Make cluster submission use Sp...

2017-12-01 Thread tgravescs

Github user tgravescs commented on the issue:

https://github.com/apache/spark/pull/19631
  
> If I understand what you're saying correctly, that should be considered a 
security issue in that server application regardless of this change. The server 
should not be exposing its environment to unprivileged users.

Yes this is what i was saying and agree one could argue its their fault its 
more of us being nice because we are not adding it in and they might not 
realize its something they shouldn't expose.  I'm not sure if there was 
anything that sensitive in there before but still good practice to not expose 
either way.


> Let me think a little about this. I prefer the environment to the 
credentials approach because the 
>  latter are written to disk, but at the same time, that's less 
problematic than exposing the secret to users in the STS.
> 

I agree written to disk isn't ideal either but its only local disk and not 
HDFS so user would have to compromise the box vs someone simply accidentally 
allowing access to env/conf via a ui or console.

reviewing changes now.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19631: [SPARK-22372][core, yarn] Make cluster submission...

2017-12-01 Thread tgravescs

Github user tgravescs commented on a diff in the pull request:

https://github.com/apache/spark/pull/19631#discussion_r154378257
  
--- Diff: project/MimaExcludes.scala ---
@@ -36,6 +36,12 @@ object MimaExcludes {
 
   // Exclude rules for 2.3.x
   lazy val v23excludes = v22excludes ++ Seq(
+// SPARK-22372: Make cluster submission use SparkApplication.
+
ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.deploy.SparkHadoopUtil.getSecretKeyFromUserCredentials"),
--- End diff --

This is only going into spark 2.3 ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19831: [SPARK-22626][SQL] Wrong Hive table statistics may trigg...

2017-12-01 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/19831
  
Is it really an issue? If you manually set a wrong statistics, how would 
you expect the system to do? I think data source tables don't allow you set the 
statistics manually, so this problem is inherited from Hive. cc @wzhfy to 
confirm.

This PR treats 0 row count as invalid, which is arguable, i.e. if we 
analyze an empty table, and then the 0 row count is valid.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19631: [SPARK-22372][core, yarn] Make cluster submission...

2017-12-01 Thread tgravescs

Github user tgravescs commented on a diff in the pull request:

https://github.com/apache/spark/pull/19631#discussion_r154373635
  
--- Diff: core/src/main/scala/org/apache/spark/SecurityManager.scala ---
@@ -542,7 +496,54 @@ private[spark] class SecurityManager(
* Gets the secret key.
* @return the secret key as a String if authentication is enabled, 
otherwise returns null
*/
-  def getSecretKey(): String = secretKey
+  def getSecretKey(): String = {
+if (isAuthenticationEnabled) {
+  Option(sparkConf.getenv(ENV_AUTH_SECRET))
--- End diff --

I agree SPARK_AUTH_SECRET_CONF is awkward and not really secure, when I 
initially did this , this is what was requested by other committers since 
standalone and mesos needed more security work around it anyway.

I don't follow how the MesosSecretConfig is going to be used fully.  Are 
these just regular spark configs passed around or distributed through mesos 
somehow?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19855: [SPARK-22662] [SQL] Failed to prune columns after rewrit...

2017-12-01 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/19855
  
cc @jiangxb1987 who has more context.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18906: [SPARK-21692][PYSPARK][SQL] Add nullability support to P...

2017-12-01 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/18906
  
For 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ScalaUDF.scala#L42,
 I don't think it's a public API.

So, Scala side are:

```scala
case class UserDefinedFunction protected[sql] (
f: AnyRef,
dataType: DataType,
inputTypes: Option[Seq[DataType]]) {
```

```scala
def udf[RT: TypeTag](f: Function0[RT]): UserDefinedFunction = {
```

```scala
def register[RT: TypeTag](name: String, func: Function0[RT]): 
UserDefinedFunction = {
```

You are proposing adding `nullable` as below:

```python
 class UserDefinedFunction(object):
 """
 User defined function in Python
 
 .. versionadded:: 1.3
  """
  def __init__(self, func,
  returnType=StringType(), name=None,
  evalType=PythonEvalType.SQL_BATCHED_UDF,
  nullable=True):
```


```python
def udf(f=None, returnType=StringType(), nullable=True):
```

```python
def registerFunction(self, name, f, returnType=StringType(), nullable=True):
```

I don't think these are quite consistent?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19861: [SPARK-22387][SQL] Propagate session configs to data sou...

2017-12-01 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19861
  
**[Test build #84377 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84377/testReport)**
 for PR 19861 at commit 
[`eaa6cae`](https://github.com/apache/spark/commit/eaa6cae1fd61f21238acdcfc7477d559cc47).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19861: [SPARK-22387][SQL] Propagate session configs to d...

2017-12-01 Thread jiangxb1987

GitHub user jiangxb1987 opened a pull request:

https://github.com/apache/spark/pull/19861

[SPARK-22387][SQL] Propagate session configs to data source read/write 
options

## What changes were proposed in this pull request?

Introduce a new interface `ConfigSupport` for `DataSourceV2`, it can help 
to propagate session configs with chosen key-prefixes to the particular data 
source.

## How was this patch tested?

Add corresponding test case in `DataSourceV2Suite`

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/jiangxb1987/spark datasource-configs

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19861.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19861


commit 356c55eb5d50db1ab0fe9f15285cf31d993fad8a
Author: Xingbo Jiang 
Date:   2017-12-01T12:17:58Z

propagate session configs to data source options.

commit b076a69b0180825d1726019ef6213d1be2324c26
Author: Xingbo Jiang 
Date:   2017-12-01T13:36:22Z

fix scalastyle




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19850: [SPARK-22653][CORE] executorAddress registered in Coarse...

2017-12-01 Thread tgravescs

Github user tgravescs commented on the issue:

https://github.com/apache/spark/pull/19850
  
thanks @cloud-fan 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18906: [SPARK-21692][PYSPARK][SQL] Add nullability support to P...

2017-12-01 Thread ptkool

Github user ptkool commented on the issue:

https://github.com/apache/spark/pull/18906
  
@HyukjinKwon As requested, here are the related Scala API changes:


https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ScalaUDF.scala#L42


https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/expressions/UserDefinedFunction.scala#L69


https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L3281


https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/UDFRegistration.scala#L167


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19860: [SPARK-22669][SQL] Avoid unnecessary function calls in c...

2017-12-01 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19860
  
**[Test build #84376 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84376/testReport)**
 for PR 19860 at commit 
[`ce74fb8`](https://github.com/apache/spark/commit/ce74fb8baff13332bed8c64db6c2a49be815f81f).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19860: [SPARK-22669][SQL] Avoid unnecessary function cal...

2017-12-01 Thread mgaido91

GitHub user mgaido91 opened a pull request:

https://github.com/apache/spark/pull/19860

[SPARK-22669][SQL] Avoid unnecessary function calls in code generation

## What changes were proposed in this pull request?

In many parts of the codebase for code generation, we are splitting the 
code to avoid exceptions due to the 64KB method size limit. This is generating 
a lot of methods which are called every time, even though sometime this is not 
needed. As pointed out here: 
https://github.com/apache/spark/pull/19752#discussion_r153081547, this is a not 
negligible overhead which can be avoided.

The PR applies the same approach used in #19752 also to the other places 
where this was feasible.

## How was this patch tested?

existing UTs.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/mgaido91/spark SPARK-22669

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19860.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19860


commit ce74fb8baff13332bed8c64db6c2a49be815f81f
Author: Marco Gaido 
Date:   2017-12-01T12:55:58Z

[SPARK-22669][SQL] Avoid unnecessary function calls in code generation




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18113: [SPARK-20890][SQL] Added min and max typed aggreg...

2017-12-01 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/18113#discussion_r154341528
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/typedaggregators.scala
 ---
@@ -99,3 +94,91 @@ class TypedAverage[IN](val f: IN => Double) extends 
Aggregator[IN, (Double, Long
 toColumn.asInstanceOf[TypedColumn[IN, java.lang.Double]]
   }
 }
+
+class TypedMinDouble[IN](val f: IN => Double) extends Aggregator[IN, 
Double, Double] {
+  override def zero: Double = Double.PositiveInfinity
+  override def reduce(b: Double, a: IN): Double = math.min(b, f(a))
+  override def merge(b1: Double, b2: Double): Double = math.min(b1, b2)
+  override def finish(reduction: Double): Double = {
+if (Double.PositiveInfinity == reduction) {
--- End diff --

+1 for option 1. It sounds the best among three to me too.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19859: [SPARK-22634][BUILD] Update Bouncy Castle to 1.58

2017-12-01 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19859
  
**[Test build #84375 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84375/testReport)**
 for PR 19859 at commit 
[`222d369`](https://github.com/apache/spark/commit/222d3690c09fdc4e99e6f2f45e786175d3153049).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19859: [SPARK-22634][BUILD] Update Bouncy Castle to 1.58

2017-12-01 Thread srowen

GitHub user srowen opened a pull request:

https://github.com/apache/spark/pull/19859

[SPARK-22634][BUILD] Update Bouncy Castle to 1.58

## What changes were proposed in this pull request?

Update Bouncy Castle to 1.58, and jets3t to 0.9.4 to (sort of) match.

## How was this patch tested?

Existing tests

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/srowen/spark SPARK-22634

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19859.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19859


commit 222d3690c09fdc4e99e6f2f45e786175d3153049
Author: Sean Owen 
Date:   2017-12-01T12:11:25Z

Update Bouncy Castle to 1.58




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19846: [SPARK-22393][SPARK-SHELL] spark-shell can't find...

2017-12-01 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/19846


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

1 2 >

1 - 100 of 131 matches

Mail list logo