date:20171128

[GitHub] spark issue #19843: [SPARK-22644][ML][TEST][WIP] Make ML testsuite support S...

2017-11-28 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19843
  
**[Test build #84287 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84287/testReport)**
 for PR 19843 at commit 
[`08954fe`](https://github.com/apache/spark/commit/08954fe0e299a3dcf992d5a6cbd5f12d907ac453).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19839: SPARK-22373 Bump Janino dependency version to fix thread...

2017-11-28 Thread mgaido91

Github user mgaido91 commented on the issue:

https://github.com/apache/spark/pull/19839
  
I think we don't need this since we need to upgrade to the next janino 
release for the issue related to SPARK-18016.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19843: [SPARK-22644][ML][TEST][WIP] Make ML testsuite support S...

2017-11-28 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19843
  
**[Test build #84286 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84286/testReport)**
 for PR 19843 at commit 
[`072f4b9`](https://github.com/apache/spark/commit/072f4b9f330af61e7de07c8c2e57421448aa306b).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19843: [SPARK-22644][ML][TEST][WIP] Make ML testsuite su...

2017-11-28 Thread WeichenXu123

GitHub user WeichenXu123 opened a pull request:

https://github.com/apache/spark/pull/19843

[SPARK-22644][ML][TEST][WIP] Make ML testsuite support StructuredStreaming 
test

## What changes were proposed in this pull request?

We need to add some helper code to make testing ML transformers & models 
easier with streaming data. These tests might help us catch any remaining 
issues and we could encourage future PRs to use these tests to prevent new 
Models & Transformers from having issues.

I add a `MLTest` trait which extends `StreamTest` trait, and override 
`createSparkSession`. So ML testsuite can only extend `MLTest`, to use both ML 
& Stream test util functions.

## How was this patch tested?

N/A

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/WeichenXu123/spark ml_stream_test_helper

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19843.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19843


commit 072f4b9f330af61e7de07c8c2e57421448aa306b
Author: WeichenXu 
Date:   2017-11-28T14:04:35Z

init pr




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19805: [PYTHON][SQL] Adding localCheckpoint to Dataset API

2017-11-28 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19805
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84282/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19805: [PYTHON][SQL] Adding localCheckpoint to Dataset API

2017-11-28 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19805
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19793: [SPARK-22574] [Mesos] [Submit] Check submission r...

2017-11-28 Thread Gschiavon

Github user Gschiavon commented on a diff in the pull request:

https://github.com/apache/spark/pull/19793#discussion_r153709924
  
--- Diff: 
core/src/test/scala/org/apache/spark/deploy/rest/SubmitRestProtocolSuite.scala 
---
@@ -86,6 +86,8 @@ class SubmitRestProtocolSuite extends SparkFunSuite {
 message.clientSparkVersion = "1.2.3"
 message.appResource = "honey-walnut-cherry.jar"
 message.mainClass = "org.apache.spark.examples.SparkPie"
+message.appArgs = Array("hdfs://tmp/auth")
+message.environmentVariables = Map("SPARK_HOME" -> "/test")
--- End diff --

@susanxhuynh Yes we have to make sure we don't break the Standalone mode.
I've reviewed StandaloneSubmitRequestServlet class and i think we might be 
facing the same problem here, (L.126-131) You can find checks for 'appResource' 
and 'mainClass' but not for 'appArgs'(L.140) or 'environmentVariables'(L.143) 
when they could be null as they are initialised as null.

I think is the same case, let me know what you think.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19805: [PYTHON][SQL] Adding localCheckpoint to Dataset API

2017-11-28 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19805
  
**[Test build #84282 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84282/testReport)**
 for PR 19805 at commit 
[`59b5562`](https://github.com/apache/spark/commit/59b5562d1cfe738dc991ef17afad71abd891195d).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19218: [SPARK-21786][SQL] The 'spark.sql.parquet.compression.co...

2017-11-28 Thread fjh100456

Github user fjh100456 commented on the issue:

https://github.com/apache/spark/pull/19218
  
@gatorsmile 
I'd tested the performance of 'uncompressed', 'snappy', 'gzip' compression 
algorithm for parquet, the input data volume is 22MB, 220MB, 1100MB, 
respectively run 10 times, finally 'snappy' in several cases are more excellent.
The test results are as follows:(TimeUnit: ms)

![default](https://user-images.githubusercontent.com/26785576/33362659-74c0cf06-d518-11e7-9907-2f353ffed37d.png)



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19842: [SPARK-22643][SQL] ColumnarArray should be an immutable ...

2017-11-28 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19842
  
**[Test build #84285 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84285/testReport)**
 for PR 19842 at commit 
[`aaa33dd`](https://github.com/apache/spark/commit/aaa33dde6c392ad96f4673a70480d7bbb73e0d41).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19842: [SPARK-22643][SQL] ColumnarArray should be an imm...

2017-11-28 Thread cloud-fan

GitHub user cloud-fan opened a pull request:

https://github.com/apache/spark/pull/19842

[SPARK-22643][SQL] ColumnarArray should be an immutable view

## What changes were proposed in this pull request?

To make `ColumnVector` public, `ColumnarArray` need to be public too, and 
we should not have mutable public fields in a public class. This PR proposes to 
make `ColumnarArray` an immutable view of the data, and always create a new 
instance of `ColumnarArray` in `ColumnVector#getArray`

## How was this patch tested?

locally tested with TPCDS, no performance regression is observed.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/cloud-fan/spark column-vector

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19842.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19842


commit aaa33dde6c392ad96f4673a70480d7bbb73e0d41
Author: Wenchen Fan 
Date:   2017-11-29T07:16:57Z

ColumnarArray should be an immutable view




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19842: [SPARK-22643][SQL] ColumnarArray should be an immutable ...

2017-11-28 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/19842
  
cc @michal-databricks @hvanhovell @kiszk @gatorsmile 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19812: [SPARK-22598][CORE] ExecutorAllocationManager does not r...

2017-11-28 Thread jerryshao

Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/19812
  
Does this failure ". For some reason, all of the 3 executors failed. " 
happened during task running or before task submission? Besides, if you're 
running on yarn, yarn will bring new executors to meet the requirement, also 
`ExecutorAllocationManager` will be notified with executor lost/register. Can 
you please tell us how to reproduce your scenario?




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19750: [SPARK-20650][core] Remove JobProgressListener.

2017-11-28 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19750
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19750: [SPARK-20650][core] Remove JobProgressListener.

2017-11-28 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19750
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84281/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19750: [SPARK-20650][core] Remove JobProgressListener.

2017-11-28 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19750
  
**[Test build #84281 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84281/testReport)**
 for PR 19750 at commit 
[`81b3f3d`](https://github.com/apache/spark/commit/81b3f3dd223785e02ad2df2a6e40046a76539681).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19812: [SPARK-22598][CORE] ExecutorAllocationManager does not r...

2017-11-28 Thread liutang123

Github user liutang123 commented on the issue:

https://github.com/apache/spark/pull/19812
  
Hi @jerryshao , I modified the info of this PR.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19468: [SPARK-18278] [Scheduler] Spark on Kubernetes - B...

2017-11-28 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/19468


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19468: [SPARK-18278] [Scheduler] Spark on Kubernetes - Basic Sc...

2017-11-28 Thread rxin

Github user rxin commented on the issue:

https://github.com/apache/spark/pull/19468
  
For future pull requests, can you create subtasks under 
https://issues.apache.org/jira/browse/SPARK-18278 ?



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19468: [SPARK-18278] [Scheduler] Spark on Kubernetes - Basic Sc...

2017-11-28 Thread rxin

Github user rxin commented on the issue:

https://github.com/apache/spark/pull/19468
  
Thanks - merging in master!



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19840: [SPARK-22640][PYSPARK][YARN]switch python exec on execut...

2017-11-28 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19840
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84280/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19840: [SPARK-22640][PYSPARK][YARN]switch python exec on execut...

2017-11-28 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19840
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19840: [SPARK-22640][PYSPARK][YARN]switch python exec on execut...

2017-11-28 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19840
  
**[Test build #84280 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84280/testReport)**
 for PR 19840 at commit 
[`8ff5663`](https://github.com/apache/spark/commit/8ff5663fe9a32eae79c8ee6bc310409170a8da64).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19825: [SPARK-22615][SQL] Handle more cases in PropagateEmptyRe...

2017-11-28 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19825
  
**[Test build #84284 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84284/testReport)**
 for PR 19825 at commit 
[`c359fe3`](https://github.com/apache/spark/commit/c359fe33de8425f932c0d447de358e98c0d3eb84).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19841: [SPARK-22642][SQL] the createdTempDir will not be delete...

2017-11-28 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19841
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19841: [SPARK-22642][SQL] the createdTempDir will not be...

2017-11-28 Thread zuotingbing

GitHub user zuotingbing opened a pull request:

https://github.com/apache/spark/pull/19841

[SPARK-22642][SQL] the createdTempDir will not be deleted if an exception 
occurs


## What changes were proposed in this pull request?

We found staging directories will not be dropped sometimes in our 
production environment.
The createdTempDir will not be deleted if an exception occurs, we should 
delete createdTempDir in finally.
Refer to SPARK-18703ã

## How was this patch tested?

exist tests



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zuotingbing/spark SPARK-stagedir

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19841.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19841


commit 99d1fd74b7f25a7ddbda4dd01a8b4c03b3da778a
Author: zuotingbing 
Date:   2017-11-29T06:22:20Z

[SPARK-22642][SQL] the createdTempDir will not be deleted if an exception 
occurs




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19823: [SPARK-22601][SQL] Data load is getting displayed succes...

2017-11-28 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19823
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19823: [SPARK-22601][SQL] Data load is getting displayed succes...

2017-11-28 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19823
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84283/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19821: [WIP][SPARK-22608][SQL] add new API to CodeGeneration.sp...

2017-11-28 Thread kiszk

Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/19821
  
Sure, I have resolved the conflict in my environment. I will commit soon.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19823: [SPARK-22601][SQL] Data load is getting displayed succes...

2017-11-28 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19823
  
**[Test build #84283 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84283/testReport)**
 for PR 19823 at commit 
[`a60843c`](https://github.com/apache/spark/commit/a60843c12f197282ae046a96772cf12c548f43a9).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19651: [SPARK-20682][SPARK-15474][SPARK-21791] Add new ORCFileF...

2017-11-28 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19651
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84278/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19651: [SPARK-20682][SPARK-15474][SPARK-21791] Add new ORCFileF...

2017-11-28 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19651
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19651: [SPARK-20682][SPARK-15474][SPARK-21791] Add new ORCFileF...

2017-11-28 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19651
  
**[Test build #84278 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84278/testReport)**
 for PR 19651 at commit 
[`e13dfa3`](https://github.com/apache/spark/commit/e13dfa3600b1a1be3d659eb79ceb73b4078067f8).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19834: [SPARK-22585][Core] Path in addJar is not url encoded

2017-11-28 Thread jerryshao

Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/19834
  
LGTM.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18714: [SPARK-20236][SQL] runtime partition overwrite

2017-11-28 Thread felixcheung

Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/18714
  
ping, very interested in this.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19805: [PYTHON][SQL] Adding localCheckpoint to Dataset A...

2017-11-28 Thread ueshin

Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/19805#discussion_r153693534
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -537,9 +536,55 @@ class Dataset[T] private[sql](
*/
   @Experimental
   @InterfaceStability.Evolving
-  def checkpoint(eager: Boolean): Dataset[T] = {
+  def checkpoint(eager: Boolean = true): Dataset[T] = _checkpoint(eager = 
eager)
+
+  /**
+   * Eagerly locally checkpoints a Dataset and return the new Dataset. 
Checkpointing can be
+   * used to truncate the logical plan of this Dataset, which is 
especially useful in iterative
+   * algorithms where the plan may grow exponentially. Local checkpoints 
are written to executor
+   * storage and despite potentially faster they are unreliable and may 
compromise job completion.
+   *
+   * @group basic
+   * @since 2.3.0
+   */
+  @Experimental
+  @InterfaceStability.Evolving
+  def localCheckpoint(): Dataset[T] = _checkpoint(eager = true, local = 
true)
+
+  /**
+   * Locally checkpoints a Dataset and return the new Dataset. 
Checkpointing can be used to truncate
+   * the logical plan of this Dataset, which is especially useful in 
iterative algorithms where the
+   * plan may grow exponentially. Local checkpoints are written to 
executor storage and despite
+   * potentially faster they are unreliable and may compromise job 
completion.
+   *
+   * @group basic
+   * @since 2.3.0
+   */
+  @Experimental
+  @InterfaceStability.Evolving
+  def localCheckpoint(eager: Boolean = true): Dataset[T] = 
_checkpoint(eager = eager, local = true)
--- End diff --

ditto.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19651: [SPARK-20682][SPARK-15474][SPARK-21791] Add new ORCFileF...

2017-11-28 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19651
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84279/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19805: [PYTHON][SQL] Adding localCheckpoint to Dataset A...

2017-11-28 Thread ueshin

Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/19805#discussion_r153693511
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -537,9 +536,55 @@ class Dataset[T] private[sql](
*/
   @Experimental
   @InterfaceStability.Evolving
-  def checkpoint(eager: Boolean): Dataset[T] = {
+  def checkpoint(eager: Boolean = true): Dataset[T] = _checkpoint(eager = 
eager)
--- End diff --

We don't need the default value for `eager` here.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19805: [PYTHON][SQL] Adding localCheckpoint to Dataset A...

2017-11-28 Thread ueshin

Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/19805#discussion_r153693403
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -518,13 +518,12 @@ class Dataset[T] private[sql](
* the logical plan of this Dataset, which is especially useful in 
iterative algorithms where the
* plan may grow exponentially. It will be saved to files inside the 
checkpoint
* directory set with `SparkContext#setCheckpointDir`.
-   *
--- End diff --

Maybe we should revert this back?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19805: [PYTHON][SQL] Adding localCheckpoint to Dataset A...

2017-11-28 Thread ueshin

Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/19805#discussion_r153694085
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -537,9 +536,55 @@ class Dataset[T] private[sql](
*/
   @Experimental
   @InterfaceStability.Evolving
-  def checkpoint(eager: Boolean): Dataset[T] = {
+  def checkpoint(eager: Boolean = true): Dataset[T] = _checkpoint(eager = 
eager)
+
+  /**
+   * Eagerly locally checkpoints a Dataset and return the new Dataset. 
Checkpointing can be
+   * used to truncate the logical plan of this Dataset, which is 
especially useful in iterative
+   * algorithms where the plan may grow exponentially. Local checkpoints 
are written to executor
+   * storage and despite potentially faster they are unreliable and may 
compromise job completion.
+   *
+   * @group basic
+   * @since 2.3.0
+   */
+  @Experimental
+  @InterfaceStability.Evolving
+  def localCheckpoint(): Dataset[T] = _checkpoint(eager = true, local = 
true)
+
+  /**
+   * Locally checkpoints a Dataset and return the new Dataset. 
Checkpointing can be used to truncate
+   * the logical plan of this Dataset, which is especially useful in 
iterative algorithms where the
+   * plan may grow exponentially. Local checkpoints are written to 
executor storage and despite
+   * potentially faster they are unreliable and may compromise job 
completion.
+   *
+   * @group basic
+   * @since 2.3.0
+   */
+  @Experimental
+  @InterfaceStability.Evolving
+  def localCheckpoint(eager: Boolean = true): Dataset[T] = 
_checkpoint(eager = eager, local = true)
+
+  /**
+   * Returns a checkpointed version of this Dataset. Checkpointing can be 
used to truncate the
+   * logical plan of this Dataset, which is especially useful in iterative 
algorithms where the
+   * plan may grow exponentially.
+   * By default reliable checkpoints are created and saved to files inside 
the checkpoint
+   * directory set with `SparkContext#setCheckpointDir`. If local is set 
to true a local checkpoint
+   * is performed instead. Local checkpoints are written to executor 
storage and despite
+   * potentially faster they are unreliable and may compromise job 
completion.
+   *
+   * @group basic
+   * @since 2.3.0
+   */
+  @Experimental
+  @InterfaceStability.Evolving
+  private[sql] def _checkpoint(eager: Boolean, local: Boolean = false): 
Dataset[T] = {
--- End diff --

We should make this `private`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19651: [SPARK-20682][SPARK-15474][SPARK-21791] Add new ORCFileF...

2017-11-28 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19651
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19651: [SPARK-20682][SPARK-15474][SPARK-21791] Add new ORCFileF...

2017-11-28 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19651
  
**[Test build #84277 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84277/testReport)**
 for PR 19651 at commit 
[`decaa50`](https://github.com/apache/spark/commit/decaa50be159ecf6b87dfb68cbe44bec91eaea9e).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `  trait Converter `
  * `  trait OrcDataUpdater `
  * `  final class RowUpdater(row: InternalRow, i: Int) extends 
OrcDataUpdater `
  * `  final class ArrayDataUpdater(updater: OrcDataUpdater) extends 
OrcDataUpdater `
  * `  final class MapDataUpdater(`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19651: [SPARK-20682][SPARK-15474][SPARK-21791] Add new ORCFileF...

2017-11-28 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19651
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84277/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19651: [SPARK-20682][SPARK-15474][SPARK-21791] Add new ORCFileF...

2017-11-28 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19651
  
**[Test build #84279 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84279/testReport)**
 for PR 19651 at commit 
[`fdab6a7`](https://github.com/apache/spark/commit/fdab6a7eb3acc63f78889fd31f8f078fca66aa0f).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19651: [SPARK-20682][SPARK-15474][SPARK-21791] Add new ORCFileF...

2017-11-28 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19651
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19823: [SPARK-22601][SQL] Data load is getting displayed succes...

2017-11-28 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19823
  
**[Test build #84283 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84283/testReport)**
 for PR 19823 at commit 
[`a60843c`](https://github.com/apache/spark/commit/a60843c12f197282ae046a96772cf12c548f43a9).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19823: [SPARK-22601][SQL] Data load is getting displayed succes...

2017-11-28 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/19823
  
retest  this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19823: [SPARK-22601][SQL] Data load is getting displayed...

2017-11-28 Thread sujith71955

Github user sujith71955 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19823#discussion_r153693386
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala 
---
@@ -2392,5 +2392,13 @@ abstract class DDLSuite extends QueryTest with 
SQLTestUtils {
 }
   }
 }
+test("load command invalid path validation ") {
--- End diff --

fixed. thanks


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19823: [SPARK-22601][SQL] Data load is getting displayed...

2017-11-28 Thread sujith71955

Github user sujith71955 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19823#discussion_r153693359
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala 
---
@@ -2392,5 +2392,13 @@ abstract class DDLSuite extends QueryTest with 
SQLTestUtils {
 }
   }
 }
+test("load command invalid path validation ") {
--- End diff --

updated it. In my local build it was working fine. Thanks for the feedback


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19805: [PYTHON][SQL] Adding localCheckpoint to Dataset API

2017-11-28 Thread ueshin

Github user ueshin commented on the issue:

https://github.com/apache/spark/pull/19805
  
@ferdonline Could you file a JIRA issue and add the id to the title like 
`[SPARK-xxx][PYTHON][SQL] ...`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19805: [PYTHON][SQL] Adding localCheckpoint to Dataset API

2017-11-28 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19805
  
**[Test build #84282 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84282/testReport)**
 for PR 19805 at commit 
[`59b5562`](https://github.com/apache/spark/commit/59b5562d1cfe738dc991ef17afad71abd891195d).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19805: [PYTHON][SQL] Adding localCheckpoint to Dataset API

2017-11-28 Thread ueshin

Github user ueshin commented on the issue:

https://github.com/apache/spark/pull/19805
  
ok to test


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19792: [SPARK-22566][PYTHON] Better error message for `_...

2017-11-28 Thread ueshin

Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/19792#discussion_r153692220
  
--- Diff: python/pyspark/sql/types.py ---
@@ -1108,19 +1109,23 @@ def _has_nulltype(dt):
 return isinstance(dt, NullType)
 
 
-def _merge_type(a, b):
+def _merge_type(a, b, path=''):
--- End diff --

Well, can you follow the #18521 format for now? Thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19833: [SPARK-22605][SQL] SQL write job should also set Spark t...

2017-11-28 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19833
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19833: [SPARK-22605][SQL] SQL write job should also set Spark t...

2017-11-28 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19833
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84276/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19833: [SPARK-22605][SQL] SQL write job should also set Spark t...

2017-11-28 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19833
  
**[Test build #84276 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84276/testReport)**
 for PR 19833 at commit 
[`be1c62e`](https://github.com/apache/spark/commit/be1c62e5b16a8b4a7bb177be5a5042c9006d2903).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19835: [SPARK-21866][ML][PYTHON][FOLLOWUP] Few cleanups and tes...

2017-11-28 Thread imatiach-msft

Github user imatiach-msft commented on the issue:

https://github.com/apache/spark/pull/19835
  
the changes look good to me, the extra verification logic for arguments is 
a great addition


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19750: [SPARK-20650][core] Remove JobProgressListener.

2017-11-28 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19750
  
**[Test build #84281 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84281/testReport)**
 for PR 19750 at commit 
[`81b3f3d`](https://github.com/apache/spark/commit/81b3f3dd223785e02ad2df2a6e40046a76539681).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19835: [SPARK-21866][ML][PYTHON][FOLLOWUP] Few cleanups ...

2017-11-28 Thread imatiach-msft

Github user imatiach-msft commented on a diff in the pull request:

https://github.com/apache/spark/pull/19835#discussion_r153689403
  
--- Diff: python/pyspark/ml/image.py ---
@@ -146,7 +163,12 @@ def toImage(self, array, origin=""):
 mode = ocvTypes["CV_8UC4"]
 else:
 raise ValueError("Invalid number of channels")
-data = bytearray(array.astype(dtype=np.uint8).ravel())
+
+# Running `bytearray(numpy.array([1]))` fails in specific Python 
versions
+# with a specific Numpy version, for example in Python 3.6.0 and 
NumPy 1.13.3.
+# Here, it avoids it by converting it to bytes.
+data = bytearray(array.astype(dtype=np.uint8).ravel().tobytes())
--- End diff --

strange, but the comment explains the issue well and I think this is a good 
workaround


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19835: [SPARK-21866][ML][PYTHON][FOLLOWUP] Few cleanups ...

2017-11-28 Thread imatiach-msft

Github user imatiach-msft commented on a diff in the pull request:

https://github.com/apache/spark/pull/19835#discussion_r153689356
  
--- Diff: python/pyspark/ml/tests.py ---
@@ -1836,6 +1836,24 @@ def test_read_images(self):
 self.assertEqual(ImageSchema.imageFields, expected)
 self.assertEqual(ImageSchema.undefinedImageType, "Undefined")
 
+with QuietTest(self.sc):
--- End diff --

nice tests!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19750: [SPARK-20650][core] Remove JobProgressListener.

2017-11-28 Thread dongjoon-hyun

Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/19750
  
Retest this please.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19833: [SPARK-22605][SQL] SQL write job should also set ...

2017-11-28 Thread dongjoon-hyun

Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/19833#discussion_r153687993
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/BasicWriteStatsTracker.scala
 ---
@@ -106,6 +105,13 @@ class BasicWriteTaskStatsTracker(hadoopConf: 
Configuration)
 
   override def getFinalStats(): WriteTaskStats = {
 statCurrentFile()
+
+// Reports bytesWritten and recordsWritten to the Spark output metrics.
+Option(TaskContext.get()).map(_.taskMetrics().outputMetrics).foreach { 
outputMetrics =>
--- End diff --

Which test suite covers this newly added logic?
Previously, BasicWriteTaskStatsTrackerSuite seems to fail (8 of 10 test 
cases) because TaskContext is absent in the test suite.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19840: [SPARK-22640][PYSPARK][YARN]switch python exec in execut...

2017-11-28 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19840
  
**[Test build #84280 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84280/testReport)**
 for PR 19840 at commit 
[`8ff5663`](https://github.com/apache/spark/commit/8ff5663fe9a32eae79c8ee6bc310409170a8da64).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19840: [SPARK-22640][PYSPARK][YARN]switch python exec in...

2017-11-28 Thread yaooqinn

GitHub user yaooqinn opened a pull request:

https://github.com/apache/spark/pull/19840

[SPARK-22640][PYSPARK][YARN]switch python exec in executor side

## What changes were proposed in this pull request?
```
PYSPARK_PYTHON=~/anaconda3/envs/py3/bin/python \
bin/spark-submit --master yarn --deploy-mode client \ 
--archives ~/anaconda3/envs/py3.zip \
--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=py3.zip/py3/bin/python \
--conf spark.executorEnv.PYSPARK_PYTHON=py3.zip/py3/bin/python  \

/home/hadoop/data/apache-spark/spark-2.1.2-bin-hadoop2.7/examples/src/main/python/mllib/correlations_example.py
```
In the case above, I created a python environment, delivered it via 
`--arichives`, then visited it on Executor Node via 
`spark.executorEnv.PYSPARK_PYTHON`.
But Executor seemed to use `PYSPARK_PYTHON=~/anaconda3/envs/py3/bin/python` 
instead of `spark.executorEnv.PYSPARK_PYTHON=py3.zip/py3/bin/python`, then 
application end with ioe.

this pr aim to switch the python exec when user specifies it.

## How was this patch tested?

manually verified with the case above.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/yaooqinn/spark SPARK-22640

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19840.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19840


commit 8ff5663fe9a32eae79c8ee6bc310409170a8da64
Author: Kent Yao 
Date:   2017-11-29T03:26:47Z

switch python exec in executor side




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19836: [SPARK-22618][CORE] Catch exception in removeRDD ...

2017-11-28 Thread brad-kaiser

Github user brad-kaiser commented on a diff in the pull request:

https://github.com/apache/spark/pull/19836#discussion_r153686556
  
--- Diff: 
core/src/main/scala/org/apache/spark/storage/BlockManagerMasterEndpoint.scala 
---
@@ -159,11 +160,18 @@ class BlockManagerMasterEndpoint(
 // Ask the slaves to remove the RDD, and put the result in a sequence 
of Futures.
 // The dispatcher is used as an implicit argument into the Future 
sequence construction.
 val removeMsg = RemoveRdd(rddId)
-Future.sequence(
-  blockManagerInfo.values.map { bm =>
-bm.slaveEndpoint.ask[Int](removeMsg)
-  }.toSeq
-)
+
+val handleRemoveRddException: PartialFunction[Throwable, Int] = {
+  case e: IOException =>
+logError(s"Error trying to remove rdd $rddId", e)
--- End diff --

Thanks for looking at my change. I changed the log to warning and "rdd" to 
"RDD". I had pulled out the partial function out because I felt like the 
expression was getting too deeply nested and hard to read. I certainly don't 
have to do that though.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

2017-11-28 Thread wzhfy

Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/19783#discussion_r153686107
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala
 ---
@@ -114,4 +114,197 @@ object EstimationUtils {
 }
   }
 
+  /**
+   * Returns the number of the first bin/bucket into which a column values 
falls for a specified
+   * numeric equi-height histogram.
+   *
+   * @param value a literal value of a column
+   * @param histogram a numeric equi-height histogram
+   * @return the number of the first bin/bucket into which a column values 
falls.
+   */
+
+  def findFirstBucketForValue(value: Double, histogram: Histogram): Int = {
--- End diff --

Shall we unify all names to `bin`/`bins` in code and comments?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19651: [SPARK-20682][SPARK-15474][SPARK-21791] Add new ORCFileF...

2017-11-28 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19651
  
**[Test build #84279 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84279/testReport)**
 for PR 19651 at commit 
[`fdab6a7`](https://github.com/apache/spark/commit/fdab6a7eb3acc63f78889fd31f8f078fca66aa0f).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19839: SPARK-22373 Bump Janino dependency version to fix thread...

2017-11-28 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19839
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19839: SPARK-22373 Bump Janino dependency version to fix...

2017-11-28 Thread Victsm

GitHub user Victsm opened a pull request:

https://github.com/apache/spark/pull/19839

SPARK-22373 Bump Janino dependency version to fix thread safety issueâ¦

â¦ with Janino when compiling generated code.

## What changes were proposed in this pull request?

Bump up Janino dependency version to fix thread safety issue during 
compiling generated code

## How was this patch tested?

Check https://issues.apache.org/jira/browse/SPARK-22373 for details.
Converted part of the code in CodeGenerator into a standalone application, 
so the issue can be consistently reproduced locally.
Verified that changing Janino dependency version resolved this issue.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/Victsm/spark SPARK-22373

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19839.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19839


commit 5f54a89fddacad5f041fb35e46b7faba27c4789d
Author: Min Shen 
Date:   2017-11-29T03:14:51Z

SPARK-22373 Bump Janino dependency version to fix thread safety issue with 
Janino when compiling generated code.




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19651: [SPARK-20682][SPARK-15474][SPARK-21791] Add new ORCFileF...

2017-11-28 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19651
  
**[Test build #84278 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84278/testReport)**
 for PR 19651 at commit 
[`e13dfa3`](https://github.com/apache/spark/commit/e13dfa3600b1a1be3d659eb79ceb73b4078067f8).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19651: [SPARK-20682][SPARK-15474][SPARK-21791] Add new O...

2017-11-28 Thread dongjoon-hyun

Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/19651#discussion_r153683384
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcDeserializer.scala
 ---
@@ -0,0 +1,304 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources.orc
+
+import scala.collection.JavaConverters._
+import scala.collection.mutable.ArrayBuffer
+
+import org.apache.hadoop.io._
+import org.apache.orc.mapred.{OrcList, OrcMap, OrcStruct, OrcTimestamp}
+import org.apache.orc.storage.serde2.io.{DateWritable, HiveDecimalWritable}
+
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.expressions.SpecificInternalRow
+import org.apache.spark.sql.catalyst.util._
+import org.apache.spark.sql.types._
+import org.apache.spark.unsafe.types.UTF8String
+
+private[orc] class OrcDeserializer(
+dataSchema: StructType,
+requiredSchema: StructType,
+missingColumnNames: Seq[String]) {
+
+  private[this] val currentRow = new 
SpecificInternalRow(requiredSchema.map(_.dataType))
+
+  private[this] val length = requiredSchema.length
+
+  private[this] val fieldConverters: Array[Converter] = 
requiredSchema.zipWithIndex.map {
+case (f, ordinal) =>
+  if (missingColumnNames.contains(f.name)) {
+null
+  } else {
+newConverter(f.dataType, new RowUpdater(currentRow, ordinal))
+  }
+  }.toArray
+
+  def deserialize(orcStruct: OrcStruct): InternalRow = {
+val names = orcStruct.getSchema.getFieldNames
+val fieldRefs = requiredSchema.map { f =>
+  val name = f.name
+  if (missingColumnNames.contains(name)) {
+null
+  } else {
+if (names.contains(name)) {
+  orcStruct.getFieldValue(name)
+} else {
+  orcStruct.getFieldValue("_col" + dataSchema.fieldIndex(name))
+}
+  }
+}.toArray
+
+var i = 0
+while (i < length) {
+  val writable = fieldRefs(i)
+  if (writable == null) {
+currentRow.setNullAt(i)
+  } else {
+fieldConverters(i).set(writable)
+  }
+  i += 1
+}
+currentRow
+  }
+
+  private[this] def newConverter(dataType: DataType, updater: 
OrcDataUpdater): Converter =
+dataType match {
+  case NullType =>
+new Converter {
+  override def set(value: Any): Unit = updater.setNullAt()
+}
+
+  case BooleanType =>
+new Converter {
+  override def set(value: Any): Unit =
+updater.setBoolean(value.asInstanceOf[BooleanWritable].get)
+}
+
+  case ByteType =>
+new Converter {
+  override def set(value: Any): Unit = 
updater.setByte(value.asInstanceOf[ByteWritable].get)
+}
+
+  case ShortType =>
+new Converter {
+  override def set(value: Any): Unit =
+updater.setShort(value.asInstanceOf[ShortWritable].get)
+}
+
+  case IntegerType =>
+new Converter {
+  override def set(value: Any): Unit = 
updater.setInt(value.asInstanceOf[IntWritable].get)
+}
+
+  case LongType =>
+new Converter {
+  override def set(value: Any): Unit = 
updater.setLong(value.asInstanceOf[LongWritable].get)
+}
+
+  case FloatType =>
+new Converter {
+  override def set(value: Any): Unit =
+updater.setFloat(value.asInstanceOf[FloatWritable].get)
+}
+
+  case DoubleType =>
+new Converter {
+  override def set(value: Any): Unit =
+updater.setDouble(value.asInstanceOf[DoubleWritable].get)
+}
+
+  case StringType =>
+new Converter {

[GitHub] spark pull request #19651: [SPARK-20682][SPARK-15474][SPARK-21791] Add new O...

2017-11-28 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19651#discussion_r153682330
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcDeserializer.scala
 ---
@@ -0,0 +1,304 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources.orc
+
+import scala.collection.JavaConverters._
+import scala.collection.mutable.ArrayBuffer
+
+import org.apache.hadoop.io._
+import org.apache.orc.mapred.{OrcList, OrcMap, OrcStruct, OrcTimestamp}
+import org.apache.orc.storage.serde2.io.{DateWritable, HiveDecimalWritable}
+
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.expressions.SpecificInternalRow
+import org.apache.spark.sql.catalyst.util._
+import org.apache.spark.sql.types._
+import org.apache.spark.unsafe.types.UTF8String
+
+private[orc] class OrcDeserializer(
+dataSchema: StructType,
+requiredSchema: StructType,
+missingColumnNames: Seq[String]) {
+
+  private[this] val currentRow = new 
SpecificInternalRow(requiredSchema.map(_.dataType))
+
+  private[this] val length = requiredSchema.length
+
+  private[this] val fieldConverters: Array[Converter] = 
requiredSchema.zipWithIndex.map {
+case (f, ordinal) =>
+  if (missingColumnNames.contains(f.name)) {
+null
+  } else {
+newConverter(f.dataType, new RowUpdater(currentRow, ordinal))
+  }
+  }.toArray
+
+  def deserialize(orcStruct: OrcStruct): InternalRow = {
+val names = orcStruct.getSchema.getFieldNames
+val fieldRefs = requiredSchema.map { f =>
+  val name = f.name
+  if (missingColumnNames.contains(name)) {
+null
+  } else {
+if (names.contains(name)) {
+  orcStruct.getFieldValue(name)
+} else {
+  orcStruct.getFieldValue("_col" + dataSchema.fieldIndex(name))
+}
+  }
+}.toArray
+
+var i = 0
+while (i < length) {
+  val writable = fieldRefs(i)
+  if (writable == null) {
+currentRow.setNullAt(i)
+  } else {
+fieldConverters(i).set(writable)
+  }
+  i += 1
+}
+currentRow
+  }
+
+  private[this] def newConverter(dataType: DataType, updater: 
OrcDataUpdater): Converter =
+dataType match {
+  case NullType =>
+new Converter {
+  override def set(value: Any): Unit = updater.setNullAt()
+}
+
+  case BooleanType =>
+new Converter {
+  override def set(value: Any): Unit =
+updater.setBoolean(value.asInstanceOf[BooleanWritable].get)
+}
+
+  case ByteType =>
+new Converter {
+  override def set(value: Any): Unit = 
updater.setByte(value.asInstanceOf[ByteWritable].get)
+}
+
+  case ShortType =>
+new Converter {
+  override def set(value: Any): Unit =
+updater.setShort(value.asInstanceOf[ShortWritable].get)
+}
+
+  case IntegerType =>
+new Converter {
+  override def set(value: Any): Unit = 
updater.setInt(value.asInstanceOf[IntWritable].get)
+}
+
+  case LongType =>
+new Converter {
+  override def set(value: Any): Unit = 
updater.setLong(value.asInstanceOf[LongWritable].get)
+}
+
+  case FloatType =>
+new Converter {
+  override def set(value: Any): Unit =
+updater.setFloat(value.asInstanceOf[FloatWritable].get)
+}
+
+  case DoubleType =>
+new Converter {
+  override def set(value: Any): Unit =
+updater.setDouble(value.asInstanceOf[DoubleWritable].get)
+}
+
+  case StringType =>
+new Converter {

[GitHub] spark pull request #19129: [SPARK-13656][SQL] Delete spark.sql.parquet.cache...

2017-11-28 Thread zzl1787

Github user zzl1787 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19129#discussion_r153682329
  
--- Diff: docs/sql-programming-guide.md ---
@@ -1587,6 +1580,10 @@ options.
   Note that this is different from the Hive behavior.
 - As a result, `DROP TABLE` statements on those tables will not remove 
the data.
 
+ - From Spark 2.0.1, `spark.sql.parquet.cacheMetadata` is no longer used. 
See
+   [SPARK-16321](https://issues.apache.org/jira/browse/SPARK-16321) and
+   [SPARK-15639](https://issues.apache.org/jira/browse/SPARK-15639) for 
details.
--- End diff --

@dongjoon-hyun Ok, got this, and thank you. Finally I find the parameter to 
control this.
`spark.sql.filesourceTableRelationCacheSize = 0`
This will disable the metadata cache.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19651: [SPARK-20682][SPARK-15474][SPARK-21791] Add new O...

2017-11-28 Thread dongjoon-hyun

Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/19651#discussion_r153682051
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcDeserializer.scala
 ---
@@ -0,0 +1,226 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources.orc
+
+import scala.collection.JavaConverters._
+import scala.collection.mutable.ArrayBuffer
+
+import org.apache.hadoop.io._
+import org.apache.orc.mapred.{OrcList, OrcMap, OrcStruct, OrcTimestamp}
+import org.apache.orc.storage.serde2.io.{DateWritable, HiveDecimalWritable}
+
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.expressions.SpecificInternalRow
+import org.apache.spark.sql.catalyst.util._
+import org.apache.spark.sql.execution.datasources.orc.OrcUtils.withNullSafe
+import org.apache.spark.sql.types._
+import org.apache.spark.unsafe.types.UTF8String
+
+private[orc] class OrcDeserializer(
+dataSchema: StructType,
+requiredSchema: StructType,
+missingColumnNames: Seq[String]) {
+
+  private[this] val mutableRow = new 
SpecificInternalRow(requiredSchema.map(_.dataType))
+
+  private[this] val length = requiredSchema.length
+
+  private[this] val unwrappers = requiredSchema.map { f =>
+if (missingColumnNames.contains(f.name)) {
+  (value: Any, row: InternalRow, ordinal: Int) => 
row.setNullAt(ordinal)
+} else {
+  unwrapperFor(f.dataType)
+}
+  }.toArray
+
+  def deserialize(orcStruct: OrcStruct): InternalRow = {
+val names = orcStruct.getSchema.getFieldNames
+val fieldRefs = requiredSchema.map { f =>
+  val name = f.name
+  if (missingColumnNames.contains(name)) {
+null
+  } else {
+if (names.contains(name)) {
+  orcStruct.getFieldValue(name)
+} else {
+  orcStruct.getFieldValue("_col" + dataSchema.fieldIndex(name))
--- End diff --

`OrcStruct` will raise Exception here.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19714: [SPARK-22489][SQL] Shouldn't change broadcast joi...

2017-11-28 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19714#discussion_r153680715
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala ---
@@ -153,6 +151,27 @@ abstract class SparkStrategies extends 
QueryPlanner[SparkPlan] {
 
   // --- BroadcastHashJoin 

 
+  case ExtractEquiJoinKeys(joinType, leftKeys, rightKeys, condition, 
left, right)
+if canBuildRight(joinType) && canBuildLeft(joinType)
+  && left.stats.hints.broadcast && right.stats.hints.broadcast =>
--- End diff --

I think we can create new methods for it
```
def shouldBuildLeft(joinType: JoinType, left: LogicalPlan, right: 
LogicalPlan): Boolean {
  if (left.stats.hints.broadcast) {
if (canBuildRight(joinType) && right.stats.hints.broadcast) {
  // if both sides have broadcast hint, only broadcast left side if its 
estimated pyhsical size is smaller than right side
  left.stats.sizeInBytes <= right.stats.sizeInBytes
} else {
  // if only left side has the broadcast hint, broadcast the left side.
  true
}
  } else {
  if (canBuildRight(joinType) && right.stats.hints.broadcast) {
// if only right side has the broadcast hint, do not broadcast the 
left side.
false
  } else {
// if neither one of the sides has broadcast hint, only broadcast 
the left side if its estimated physical size is smaller than the treshold and 
smaller than right side.
canBroadcast(left) && left.stats.sizeInBytes <= 
right.stats.sizeInBytes
  }
  }
}

def shouldBuildRight...
```
and use it like
```
case ExtractEquiJoinKeys(joinType, leftKeys, rightKeys, condition, left, 
right)
if canBuildRight(joinType) && shouldBuildRight(joinType, left, right)
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19714: [SPARK-22489][SQL] Shouldn't change broadcast joi...

2017-11-28 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19714#discussion_r153679188
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala ---
@@ -91,10 +91,10 @@ abstract class SparkStrategies extends 
QueryPlanner[SparkPlan] {
* predicates can be evaluated by matching join keys. If found,  Join 
implementations are chosen
* with the following precedence:
*
-   * - Broadcast: if one side of the join has an estimated physical size 
that is smaller than the
-   * user-configurable [[SQLConf.AUTO_BROADCASTJOIN_THRESHOLD]] 
threshold
-   * or if that side has an explicit broadcast hint (e.g. the user 
applied the
-   * [[org.apache.spark.sql.functions.broadcast()]] function to a 
DataFrame), then that side
+   * - Broadcast: if one side of the join has an explicit broadcast hint 
(e.g. the user applied the
--- End diff --

```
Broadcast: We prefer to broadcast the join side with an explicit broadcast 
hint(e.g. the user applied the [[org.apache.spark.sql.functions.broadcast()]] 
function to a DataFrame). If both sides have the broadcast hint, we prefer to 
broadcast the side with a smaller estimated physical size. If neither one of 
the sides has the broadcast hint, we only broadcast the join side if its 
estimated physical size that is smaller than the user-configurable 
[[SQLConf.AUTO_BROADCASTJOIN_THRESHOLD]] threshold.
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19717: [SPARK-18278] [Submission] Spark on Kubernetes - ...

2017-11-28 Thread jerryshao

Github user jerryshao commented on a diff in the pull request:

https://github.com/apache/spark/pull/19717#discussion_r153678912
  
--- Diff: 
resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Config.scala
 ---
@@ -0,0 +1,160 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.deploy.k8s
+
+import java.util.concurrent.TimeUnit
+
+import org.apache.spark.{SPARK_VERSION => sparkVersion}
+import org.apache.spark.internal.Logging
+import org.apache.spark.internal.config.ConfigBuilder
+import org.apache.spark.network.util.ByteUnit
+
+private[spark] object Config extends Logging {
+
+  val KUBERNETES_NAMESPACE =
+ConfigBuilder("spark.kubernetes.namespace")
+  .doc("The namespace that will be used for running the driver and 
executor pods. When using " +
+"spark-submit in cluster mode, this can also be passed to 
spark-submit via the " +
+"--kubernetes-namespace command line argument.")
+  .stringConf
+  .createWithDefault("default")
+
+  val DRIVER_DOCKER_IMAGE =
+ConfigBuilder("spark.kubernetes.driver.docker.image")
+  .doc("Docker image to use for the driver. Specify this using the 
standard Docker tag format.")
+  .stringConf
+  .createWithDefault(s"spark-driver:$sparkVersion")
+
+  val EXECUTOR_DOCKER_IMAGE =
+ConfigBuilder("spark.kubernetes.executor.docker.image")
+  .doc("Docker image to use for the executors. Specify this using the 
standard Docker tag " +
+"format.")
+  .stringConf
+  .createWithDefault(s"spark-executor:$sparkVersion")
+
+  val DOCKER_IMAGE_PULL_POLICY =
+ConfigBuilder("spark.kubernetes.docker.image.pullPolicy")
+  .doc("Docker image pull policy when pulling any docker image in 
Kubernetes integration")
+  .stringConf
+  .createWithDefault("IfNotPresent")
+
+
+  val KUBERNETES_AUTH_DRIVER_CONF_PREFIX =
+  "spark.kubernetes.authenticate.driver"
+  val KUBERNETES_AUTH_DRIVER_MOUNTED_CONF_PREFIX =
+  "spark.kubernetes.authenticate.driver.mounted"
+  val OAUTH_TOKEN_CONF_SUFFIX = "oauthToken"
+  val OAUTH_TOKEN_FILE_CONF_SUFFIX = "oauthTokenFile"
+  val CLIENT_KEY_FILE_CONF_SUFFIX = "clientKeyFile"
+  val CLIENT_CERT_FILE_CONF_SUFFIX = "clientCertFile"
+  val CA_CERT_FILE_CONF_SUFFIX = "caCertFile"
+
+  val KUBERNETES_SERVICE_ACCOUNT_NAME =
+
ConfigBuilder(s"$KUBERNETES_AUTH_DRIVER_CONF_PREFIX.serviceAccountName")
+  .doc("Service account that is used when running the driver pod. The 
driver pod uses " +
+"this service account when requesting executor pods from the API 
server. If specific " +
+"credentials are given for the driver pod to use, the driver will 
favor " +
+"using those credentials instead.")
+  .stringConf
+  .createOptional
+
+  val KUBERNETES_DRIVER_LIMIT_CORES =
+ConfigBuilder("spark.kubernetes.driver.limit.cores")
+  .doc("Specify the hard cpu limit for the driver pod")
+  .stringConf
+  .createOptional
+
+  val KUBERNETES_EXECUTOR_LIMIT_CORES =
+ConfigBuilder("spark.kubernetes.executor.limit.cores")
+  .doc("Specify the hard cpu limit for a single executor pod")
+  .stringConf
+  .createOptional
+
+  val KUBERNETES_DRIVER_MEMORY_OVERHEAD =
+ConfigBuilder("spark.kubernetes.driver.memoryOverhead")
+  .doc("The amount of off-heap memory (in megabytes) to be allocated 
for the driver and the " +
+"driver submission server. This is memory that accounts for things 
like VM overheads, " +
+"interned strings, other native overheads, etc. This tends to grow 
with the driver's " +
+"memory size (typically 6-10%).")
+  .bytesConf(ByteUnit.MiB)
+  .createOptional
+
+  // Note that while we set a default for this when we start up the
+  // scheduler, the specific default value is

[GitHub] spark pull request #19823: [SPARK-22601][SQL] Data load is getting displayed...

2017-11-28 Thread wzhfy

Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/19823#discussion_r153678875
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala 
---
@@ -2392,5 +2392,13 @@ abstract class DDLSuite extends QueryTest with 
SQLTestUtils {
 }
   }
 }
+test("load command invalid path validation ") {
--- End diff --

nit: insert an empty line before the test


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19651: [SPARK-20682][SPARK-15474][SPARK-21791] Add new ORCFileF...

2017-11-28 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19651
  
**[Test build #84277 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84277/testReport)**
 for PR 19651 at commit 
[`decaa50`](https://github.com/apache/spark/commit/decaa50be159ecf6b87dfb68cbe44bec91eaea9e).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19823: [SPARK-22601][SQL] Data load is getting displayed...

2017-11-28 Thread wzhfy

Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/19823#discussion_r153678491
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala 
---
@@ -2392,5 +2392,13 @@ abstract class DDLSuite extends QueryTest with 
SQLTestUtils {
 }
   }
 }
+test("load command invalid path validation ") {
--- End diff --

@sujith71955 according to Jenkins, there's a whitespace at end of line


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19831: [SPARK-22626][SQL] Wrong Hive table statistics may trigg...

2017-11-28 Thread wzhfy

Github user wzhfy commented on the issue:

https://github.com/apache/spark/pull/19831
  
Besides, if the size stats `totalSize` or `rawDataSize` is wrong, the 
problem also exists whether CBO is enabled or not. We need to change that in 
the title too.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19714: [SPARK-22489][SQL] Shouldn't change broadcast joi...

2017-11-28 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19714#discussion_r153677941
  
--- Diff: docs/sql-programming-guide.md ---
@@ -1492,6 +1492,61 @@ that these options will be deprecated in future 
release as more optimizations ar
   
 
 
+## Broadcast Hint for SQL Queries
+
+Broadcast hint is a way for users to manually annotate a query and suggest 
to the query optimizer the join method. 
+It is very useful when the query optimizer cannot make optimal decision 
with respect to join methods 
+due to conservativeness or the lack of proper statistics. 
+Spark Broadcast Hint has higher priority than autoBroadcastJoin mechanism, 
examples:
+
+
+
+
+
+{% highlight scala %}
+val src = sql("SELECT * FROM src")
+broadcast(src).join(recordsDF, Seq("key")).show()
--- End diff --

a more standard way:
```
import org.apache.spark.sql.functions.broadcast
broadcast(spark.table("src")).join(spark.table("records"), 
Seq("key")).show()
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19831: [SPARK-22626][SQL] Wrong Hive table statistics may trigg...

2017-11-28 Thread wzhfy

Github user wzhfy commented on the issue:

https://github.com/apache/spark/pull/19831
  
BTW, the case here is not about join reorder, it's actually about broadcast 
decision. Could you update the title of this PR?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19714: [SPARK-22489][SQL] Shouldn't change broadcast joi...

2017-11-28 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19714#discussion_r153677668
  
--- Diff: docs/sql-programming-guide.md ---
@@ -1492,6 +1492,61 @@ that these options will be deprecated in future 
release as more optimizations ar
   
 
 
+## Broadcast Hint for SQL Queries
+
+Broadcast hint is a way for users to manually annotate a query and suggest 
to the query optimizer the join method. 
+It is very useful when the query optimizer cannot make optimal decision 
with respect to join methods 
+due to conservativeness or the lack of proper statistics. 
+Spark Broadcast Hint has higher priority than autoBroadcastJoin mechanism, 
examples:
+
+
+
+
+
+{% highlight scala %}
+val src = sql("SELECT * FROM src")
+broadcast(src).join(recordsDF, Seq("key")).show()
+{% endhighlight %}
+
+
+
+
+
+{% highlight java %}
+Dataset src = sql("SELECT * FROM src");
+broadcast(src).join(recordsDF, Seq("key")).show();
+{% endhighlight %}
+
+
+
+
+
+{% highlight python %}
+src = spark.sql("SELECT * FROM src")
+recordsDF.join(broadcast(src), "key").show()
+{% endhighlight %}
+
+
+
+
+
+{% highlight r %}
+src <- sql("SELECT COUNT(*) FROM src")
+showDF(join(broadcast(src), recordsDF, src$key == recordsDF$key)))
+{% endhighlight %}
+
+
+
+
+
+{% highlight sql %}
+SELECT /*+ BROADCAST(r) */ * FROM records r JOIN src s ON r.key = s.key
+{% endhighlight %}
+
+
+
+(Note that we accept `BROADCAST`, `BROADCASTJOIN` and `MAPJOIN` for 
broadcast hint)
--- End diff --

shall we treat it as a comment on the SQL example?
```
--we accept BROADCAST, BROADCASTJOIN and MAPJOIN for broadcast hint
SELECT /*+ BROADCAST(r) */ * FROM records r JOIN src s ON r.key = s.key
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19831: [SPARK-22626][SQL] Wrong Hive table statistics ma...

2017-11-28 Thread wzhfy

Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/19831#discussion_r153677676
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala 
---
@@ -418,7 +418,7 @@ private[hive] class HiveClientImpl(
   // Note that this statistics could be overridden by Spark's 
statistics if that's available.
   val totalSize = 
properties.get(StatsSetupConst.TOTAL_SIZE).map(BigInt(_))
   val rawDataSize = 
properties.get(StatsSetupConst.RAW_DATA_SIZE).map(BigInt(_))
-  val rowCount = 
properties.get(StatsSetupConst.ROW_COUNT).map(BigInt(_)).filter(_ >= 0)
+  val rowCount = 
properties.get(StatsSetupConst.ROW_COUNT).map(BigInt(_)).filter(_ > 0)
--- End diff --

The root problem is that user can set "wrong" table properties. So if we 
want to prevent using wrong stats, we need to detect changes in properties. 
Otherwise your case can't be avoided.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19831: [SPARK-22626][SQL] Wrong Hive table statistics ma...

2017-11-28 Thread wzhfy

Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/19831#discussion_r153677300
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala 
---
@@ -418,7 +418,7 @@ private[hive] class HiveClientImpl(
   // Note that this statistics could be overridden by Spark's 
statistics if that's available.
   val totalSize = 
properties.get(StatsSetupConst.TOTAL_SIZE).map(BigInt(_))
   val rawDataSize = 
properties.get(StatsSetupConst.RAW_DATA_SIZE).map(BigInt(_))
-  val rowCount = 
properties.get(StatsSetupConst.ROW_COUNT).map(BigInt(_)).filter(_ >= 0)
+  val rowCount = 
properties.get(StatsSetupConst.ROW_COUNT).map(BigInt(_)).filter(_ > 0)
--- End diff --

Hive has a flag called `StatsSetupConst.COLUMN_STATS_ACCURATE`. If I 
remember correctly, this flag will become **false** if user changes table 
properties or table data. Can you check if the flag exists in your case? If so, 
we can use the flag to decide whether to read statistics from Hive.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19833: [SPARK-22605][SQL] SQL write job should also set Spark t...

2017-11-28 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19833
  
**[Test build #84276 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84276/testReport)**
 for PR 19833 at commit 
[`be1c62e`](https://github.com/apache/spark/commit/be1c62e5b16a8b4a7bb177be5a5042c9006d2903).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19821: [WIP][SPARK-22608][SQL] add new API to CodeGeneration.sp...

2017-11-28 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/19821
  
@kiszk Can you fix the conflict? now we can add a middle-advanced version:
```
def splitExpressions(
expressions: Seq[String],
funcName: String,
extraArguments: Seq[(String, String)])
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19838: [SPARK-22638][SS]Use a separate query for Streami...

2017-11-28 Thread jiangxb1987

Github user jiangxb1987 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19838#discussion_r153674694
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/LiveListenerBus.scala ---
@@ -87,7 +87,9 @@ private[spark] class LiveListenerBus(conf: SparkConf) {
* of each other (each one uses a separate thread for delivering 
events), allowing slower
* listeners to be somewhat isolated from others.
*/
-  private def addToQueue(listener: SparkListenerInterface, queue: String): 
Unit = synchronized {
+  private[spark] def addToQueue(
--- End diff --

Is it necessary to make this change?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19717: [SPARK-18278] [Submission] Spark on Kubernetes - ...

2017-11-28 Thread ueshin

Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/19717#discussion_r15367
  
--- Diff: 
resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/Client.scala
 ---
@@ -0,0 +1,236 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.deploy.k8s.submit
+
+import java.util.{Collections, UUID}
+
+import scala.collection.JavaConverters._
+import scala.collection.mutable
+import scala.util.control.NonFatal
+
+import io.fabric8.kubernetes.api.model._
+import io.fabric8.kubernetes.client.KubernetesClient
+
+import org.apache.spark.SparkConf
+import org.apache.spark.deploy.SparkApplication
+import org.apache.spark.deploy.k8s.Config._
+import org.apache.spark.deploy.k8s.Constants._
+import org.apache.spark.deploy.k8s.KubernetesClientFactory
+import org.apache.spark.deploy.k8s.submit.steps.DriverConfigurationStep
+import org.apache.spark.internal.Logging
+import org.apache.spark.util.Utils
+
+/**
+ * Encapsulates arguments to the submission client.
+ *
+ * @param mainAppResource the main application resource
+ * @param mainClass the main class of the application to run
+ * @param driverArgs arguments to the driver
+ */
+private[spark] case class ClientArguments(
+ mainAppResource: MainAppResource,
+ mainClass: String,
+ driverArgs: Array[String])
+
+private[spark] object ClientArguments {
+
+  def fromCommandLineArgs(args: Array[String]): ClientArguments = {
+var mainAppResource: Option[MainAppResource] = None
+var mainClass: Option[String] = None
+val driverArgs = mutable.Buffer.empty[String]
+
+args.sliding(2, 2).toList.collect {
+  case Array("--primary-java-resource", primaryJavaResource: String) =>
+mainAppResource = Some(JavaMainAppResource(primaryJavaResource))
+  case Array("--main-class", clazz: String) =>
+mainClass = Some(clazz)
+  case Array("--arg", arg: String) =>
+driverArgs += arg
+  case other =>
+val invalid = other.mkString(" ")
+throw new RuntimeException(s"Unknown arguments: $invalid")
+}
+
+require(mainAppResource.isDefined,
+  "Main app resource must be defined by --primary-java-resource.")
+require(mainClass.isDefined, "Main class must be specified via 
--main-class")
+
+ClientArguments(
+  mainAppResource.get,
+  mainClass.get,
+  driverArgs.toArray)
+  }
+}
+
+/**
+ * Submits a Spark application to run on Kubernetes by creating the driver 
pod and starting a
+ * watcher that monitors and logs the application status. Waits for the 
application to terminate if
+ * spark.kubernetes.submission.waitAppCompletion is true.
+ *
+ * @param submissionSteps steps that collectively configure the driver
+ * @param submissionSparkConf the submission client Spark configuration
+ * @param kubernetesClient the client to talk to the Kubernetes API server
+ * @param waitForAppCompletion a flag indicating whether the client should 
wait for the application
+ * to complete
+ * @param appName the application name
+ * @param loggingPodStatusWatcher a watcher that monitors and logs the 
application status
+ */
+private[spark] class Client(
+submissionSteps: Seq[DriverConfigurationStep],
+submissionSparkConf: SparkConf,
+kubernetesClient: KubernetesClient,
+waitForAppCompletion: Boolean,
+appName: String,
+loggingPodStatusWatcher: LoggingPodStatusWatcher) extends Logging {
+
+  private val driverJavaOptions = submissionSparkConf.get(
+org.apache.spark.internal.config.DRIVER_JAVA_OPTIONS)
+
+   /**
+* Run command that initializes a DriverSpec that will be updated after 
each
+* DriverConfigurationStep in the sequence that is passed in. The final

[GitHub] spark issue #19813: [WIP][SPARK-22600][SQL] Fix 64kb limit for deeply nested...

2017-11-28 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19813
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19813: [WIP][SPARK-22600][SQL] Fix 64kb limit for deeply nested...

2017-11-28 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19813
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84275/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19813: [WIP][SPARK-22600][SQL] Fix 64kb limit for deeply nested...

2017-11-28 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19813
  
**[Test build #84275 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84275/testReport)**
 for PR 19813 at commit 
[`8c7f749`](https://github.com/apache/spark/commit/8c7f7496e610fdf4b512c57efd108ccf0238b126).
 * This patch **fails SparkR unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19838: [SPARK-22638][SS]Use a separate query for StreamingQuery...

2017-11-28 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19838
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84274/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19838: [SPARK-22638][SS]Use a separate query for StreamingQuery...

2017-11-28 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19838
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19838: [SPARK-22638][SS]Use a separate query for StreamingQuery...

2017-11-28 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19838
  
**[Test build #84274 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84274/testReport)**
 for PR 19838 at commit 
[`60035fa`](https://github.com/apache/spark/commit/60035fa865fd85cc7e9441d2dc55d46693b16dee).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19468: [SPARK-18278] [Scheduler] Spark on Kubernetes - Basic Sc...

2017-11-28 Thread jiangxb1987

Github user jiangxb1987 commented on the issue:

https://github.com/apache/spark/pull/19468
  
LGTM, thanks for the awesome work!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19594: [SPARK-21984] [SQL] Join estimation based on equi...

2017-11-28 Thread ron8hu

Github user ron8hu commented on a diff in the pull request:

https://github.com/apache/spark/pull/19594#discussion_r153665547
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/JoinEstimationSuite.scala
 ---
@@ -67,6 +68,205 @@ class JoinEstimationSuite extends 
StatsEstimationTestBase {
 rowCount = 2,
 attributeStats = AttributeMap(Seq("key-1-2", 
"key-2-3").map(nameToColInfo)))
 
+  private def estimateByHistogram(
+  histogram1: Histogram,
+  histogram2: Histogram,
+  expectedMin: Double,
+  expectedMax: Double,
+  expectedNdv: Long,
+  expectedRows: Long): Unit = {
+val col1 = attr("key1")
+val col2 = attr("key2")
+val c1 = generateJoinChild(col1, histogram1, expectedMin, expectedMax)
+val c2 = generateJoinChild(col2, histogram2, expectedMin, expectedMax)
+
+val c1JoinC2 = Join(c1, c2, Inner, Some(EqualTo(col1, col2)))
+val c2JoinC1 = Join(c2, c1, Inner, Some(EqualTo(col2, col1)))
+val expectedStatsAfterJoin = Statistics(
+  sizeInBytes = expectedRows * (8 + 2 * 4),
+  rowCount = Some(expectedRows),
+  attributeStats = AttributeMap(Seq(
+col1 -> c1.stats.attributeStats(col1).copy(
+  distinctCount = expectedNdv, min = Some(expectedMin), max = 
Some(expectedMax)),
+col2 -> c2.stats.attributeStats(col2).copy(
+  distinctCount = expectedNdv, min = Some(expectedMin), max = 
Some(expectedMax
+)
+
+// Join order should not affect estimation result.
+Seq(c1JoinC2, c2JoinC1).foreach { join =>
+  assert(join.stats == expectedStatsAfterJoin)
+}
+  }
+
+  private def generateJoinChild(
+  col: Attribute,
+  histogram: Histogram,
+  expectedMin: Double,
+  expectedMax: Double): LogicalPlan = {
+val colStat = inferColumnStat(histogram)
+val t = StatsTestPlan(
+  outputList = Seq(col),
+  rowCount = (histogram.height * histogram.bins.length).toLong,
+  attributeStats = AttributeMap(Seq(col -> colStat)))
+
+val filterCondition = new ArrayBuffer[Expression]()
+if (expectedMin > colStat.min.get.toString.toDouble) {
+  filterCondition += GreaterThanOrEqual(col, Literal(expectedMin))
+}
+if (expectedMax < colStat.max.get.toString.toDouble) {
+  filterCondition += LessThanOrEqual(col, Literal(expectedMax))
+}
+if (filterCondition.isEmpty) t else 
Filter(filterCondition.reduce(And), t)
+  }
+
+  private def inferColumnStat(histogram: Histogram): ColumnStat = {
+var ndv = 0L
+for (i <- histogram.bins.indices) {
+  val bin = histogram.bins(i)
+  if (i == 0 || bin.hi != histogram.bins(i - 1).hi) {
+ndv += bin.ndv
+  }
+}
+ColumnStat(distinctCount = ndv, min = Some(histogram.bins.head.lo),
+  max = Some(histogram.bins.last.hi), nullCount = 0, avgLen = 4, 
maxLen = 4,
+  histogram = Some(histogram))
+  }
+
+  test("equi-height histograms: a bin is contained by another one") {
+val histogram1 = Histogram(height = 300, Array(
+  HistogramBin(lo = 10, hi = 30, ndv = 10), HistogramBin(lo = 30, hi = 
60, ndv = 30)))
+val histogram2 = Histogram(height = 100, Array(
+  HistogramBin(lo = 0, hi = 50, ndv = 50), HistogramBin(lo = 50, hi = 
100, ndv = 40)))
+// test bin trimming
+val (t1, h1) = trimBin(histogram2.bins(0), height = 100, min = 10, max 
= 60)
+assert(t1 == HistogramBin(lo = 10, hi = 50, ndv = 40) && h1 == 80)
+val (t2, h2) = trimBin(histogram2.bins(1), height = 100, min = 10, max 
= 60)
+assert(t2 == HistogramBin(lo = 50, hi = 60, ndv = 8) && h2 == 20)
+
+val expectedRanges = Seq(
+  OverlappedRange(10, 30, math.min(10, 40*1/2), math.max(10, 40*1/2), 
300, 80*1/2),
+  OverlappedRange(30, 50, math.min(30*2/3, 40*1/2), math.max(30*2/3, 
40*1/2), 300*2/3, 80*1/2),
+  OverlappedRange(50, 60, math.min(30*1/3, 8), math.max(30*1/3, 8), 
300*1/3, 20)
+)
+assert(expectedRanges.equals(
+  getOverlappedRanges(histogram1, histogram2, newMin = 10D, newMax = 
60D)))
+
+estimateByHistogram(
+  histogram1 = histogram1,
+  histogram2 = histogram2,
+  expectedMin = 10D,
+  expectedMax = 60D,
+  // 10 + 20 + 8
+  expectedNdv = 38L,
+  // 300*40/20 + 200*40/20 + 100*20/10
+  expectedRows = 1200L)
+  }
+
+  test("equi-height histograms: a bin has only one value") {
+val histogram1 = Histogram(height = 300, Array(
+  HistogramBin(lo = 30, hi = 30, ndv = 1), HistogramBin(lo = 30, hi = 
60, ndv = 30)))
+val histogram2 = Histogram(height = 100, Array(
+

[GitHub] spark pull request #19594: [SPARK-21984] [SQL] Join estimation based on equi...

2017-11-28 Thread ron8hu

Github user ron8hu commented on a diff in the pull request:

https://github.com/apache/spark/pull/19594#discussion_r153665092
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/JoinEstimationSuite.scala
 ---
@@ -67,6 +68,205 @@ class JoinEstimationSuite extends 
StatsEstimationTestBase {
 rowCount = 2,
 attributeStats = AttributeMap(Seq("key-1-2", 
"key-2-3").map(nameToColInfo)))
 
+  private def estimateByHistogram(
+  histogram1: Histogram,
+  histogram2: Histogram,
+  expectedMin: Double,
+  expectedMax: Double,
+  expectedNdv: Long,
+  expectedRows: Long): Unit = {
+val col1 = attr("key1")
+val col2 = attr("key2")
+val c1 = generateJoinChild(col1, histogram1, expectedMin, expectedMax)
+val c2 = generateJoinChild(col2, histogram2, expectedMin, expectedMax)
+
+val c1JoinC2 = Join(c1, c2, Inner, Some(EqualTo(col1, col2)))
+val c2JoinC1 = Join(c2, c1, Inner, Some(EqualTo(col2, col1)))
+val expectedStatsAfterJoin = Statistics(
+  sizeInBytes = expectedRows * (8 + 2 * 4),
+  rowCount = Some(expectedRows),
+  attributeStats = AttributeMap(Seq(
+col1 -> c1.stats.attributeStats(col1).copy(
+  distinctCount = expectedNdv, min = Some(expectedMin), max = 
Some(expectedMax)),
+col2 -> c2.stats.attributeStats(col2).copy(
+  distinctCount = expectedNdv, min = Some(expectedMin), max = 
Some(expectedMax
+)
+
+// Join order should not affect estimation result.
+Seq(c1JoinC2, c2JoinC1).foreach { join =>
+  assert(join.stats == expectedStatsAfterJoin)
+}
+  }
+
+  private def generateJoinChild(
+  col: Attribute,
+  histogram: Histogram,
+  expectedMin: Double,
+  expectedMax: Double): LogicalPlan = {
+val colStat = inferColumnStat(histogram)
+val t = StatsTestPlan(
+  outputList = Seq(col),
+  rowCount = (histogram.height * histogram.bins.length).toLong,
+  attributeStats = AttributeMap(Seq(col -> colStat)))
+
+val filterCondition = new ArrayBuffer[Expression]()
+if (expectedMin > colStat.min.get.toString.toDouble) {
+  filterCondition += GreaterThanOrEqual(col, Literal(expectedMin))
+}
+if (expectedMax < colStat.max.get.toString.toDouble) {
+  filterCondition += LessThanOrEqual(col, Literal(expectedMax))
+}
+if (filterCondition.isEmpty) t else 
Filter(filterCondition.reduce(And), t)
+  }
+
+  private def inferColumnStat(histogram: Histogram): ColumnStat = {
+var ndv = 0L
+for (i <- histogram.bins.indices) {
+  val bin = histogram.bins(i)
+  if (i == 0 || bin.hi != histogram.bins(i - 1).hi) {
+ndv += bin.ndv
+  }
+}
+ColumnStat(distinctCount = ndv, min = Some(histogram.bins.head.lo),
+  max = Some(histogram.bins.last.hi), nullCount = 0, avgLen = 4, 
maxLen = 4,
+  histogram = Some(histogram))
+  }
+
+  test("equi-height histograms: a bin is contained by another one") {
+val histogram1 = Histogram(height = 300, Array(
+  HistogramBin(lo = 10, hi = 30, ndv = 10), HistogramBin(lo = 30, hi = 
60, ndv = 30)))
+val histogram2 = Histogram(height = 100, Array(
+  HistogramBin(lo = 0, hi = 50, ndv = 50), HistogramBin(lo = 50, hi = 
100, ndv = 40)))
+// test bin trimming
+val (t1, h1) = trimBin(histogram2.bins(0), height = 100, min = 10, max 
= 60)
+assert(t1 == HistogramBin(lo = 10, hi = 50, ndv = 40) && h1 == 80)
+val (t2, h2) = trimBin(histogram2.bins(1), height = 100, min = 10, max 
= 60)
+assert(t2 == HistogramBin(lo = 50, hi = 60, ndv = 8) && h2 == 20)
+
+val expectedRanges = Seq(
+  OverlappedRange(10, 30, math.min(10, 40*1/2), math.max(10, 40*1/2), 
300, 80*1/2),
+  OverlappedRange(30, 50, math.min(30*2/3, 40*1/2), math.max(30*2/3, 
40*1/2), 300*2/3, 80*1/2),
+  OverlappedRange(50, 60, math.min(30*1/3, 8), math.max(30*1/3, 8), 
300*1/3, 20)
+)
+assert(expectedRanges.equals(
+  getOverlappedRanges(histogram1, histogram2, newMin = 10D, newMax = 
60D)))
+
+estimateByHistogram(
+  histogram1 = histogram1,
+  histogram2 = histogram2,
+  expectedMin = 10D,
+  expectedMax = 60D,
+  // 10 + 20 + 8
+  expectedNdv = 38L,
+  // 300*40/20 + 200*40/20 + 100*20/10
+  expectedRows = 1200L)
+  }
+
+  test("equi-height histograms: a bin has only one value") {
+val histogram1 = Histogram(height = 300, Array(
+  HistogramBin(lo = 30, hi = 30, ndv = 1), HistogramBin(lo = 30, hi = 
60, ndv = 30)))
+val histogram2 = Histogram(height = 100, Array(
+

[GitHub] spark issue #19837: [SPARK-22637][SQL] Only refresh a logical plan once.

2017-11-28 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19837
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

1 2 3 4 >

1 - 100 of 354 matches

Mail list logo