date:20170707

[GitHub] spark pull request #16422: [SPARK-17642] [SQL] support DESC EXTENDED/FORMATT...

2017-07-07 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/16422#discussion_r126275554
  
--- Diff: 
sql/core/src/test/resources/sql-tests/results/describe-table-column.sql.out ---
@@ -0,0 +1,133 @@
+-- Automatically generated by SQLQueryTestSuite
+-- Number of queries: 15
+
+
+-- !query 0
+CREATE TEMPORARY VIEW desc_col_temp_table (key int COMMENT 
'column_comment') USING PARQUET
+-- !query 0 schema
+struct<>
+-- !query 0 output
+
+
+
+-- !query 1
+DESC desc_col_temp_table key
+-- !query 1 schema
+struct
+-- !query 1 output
+col_name   data_type   comment 
+keyint column_comment
+
+
+-- !query 2
+DESC EXTENDED desc_col_temp_table key
--- End diff --

Since we already treat `DESC EXTENDED` and `DESC FORMATTED` identical for 
the tables, let us make DESC EXTENDED and FORMATTED the same for the column too.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16422: [SPARK-17642] [SQL] support DESC EXTENDED/FORMATT...

2017-07-07 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/16422#discussion_r126275528
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala ---
@@ -320,10 +320,15 @@ class SparkSqlAstBuilder(conf: SQLConf) extends 
AstBuilder(conf) {
* Create a [[DescribeTableCommand]] logical plan.
*/
   override def visitDescribeTable(ctx: DescribeTableContext): LogicalPlan 
= withOrigin(ctx) {
-// Describe column are not supported yet. Return null and let the 
parser decide
-// what to do with this (create an exception or pass it on to a 
different system).
 if (ctx.describeColName != null) {
-  null
+  if (ctx.partitionSpec != null) {
+throw new ParseException("DESC TABLE COLUMN for a specific 
partition is not supported", ctx)
+  } else {
+DescribeColumnCommand(
+  visitTableIdentifier(ctx.tableIdentifier),
+  ctx.describeColName.getText,
+  ctx.FORMATTED != null)
--- End diff --

I think we are fine to keep EXTENDED and FORMATTED the same. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18567: [SPARK-21345][SQL][TEST][test-maven] SparkSession...

2017-07-07 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/18567#discussion_r126275501
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/SparkSessionBuilderSuite.scala ---
@@ -95,7 +93,6 @@ class SparkSessionBuilderSuite extends SparkFunSuite {
   }
 
   test("create SparkContext first then SparkSession") {
-sparkContext.stop()
--- End diff --

nit: the `session.stop` at the end of test case can also be removed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18292: [SPARK-21083][SQL] Do not remove correct column s...

2017-07-07 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/18292#discussion_r126275492
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzeTableCommand.scala
 ---
@@ -40,10 +39,10 @@ case class AnalyzeTableCommand(
 }
 val newTotalSize = CommandUtils.calculateTotalSize(sessionState, 
tableMeta)
 
-val oldTotalSize = 
tableMeta.stats.map(_.sizeInBytes.toLong).getOrElse(0L)
+val oldTotalSize = 
tableMeta.stats.map(_.sizeInBytes.toLong).getOrElse(-1L)
 val oldRowCount = 
tableMeta.stats.flatMap(_.rowCount.map(_.toLong)).getOrElse(-1L)
 var newStats: Option[CatalogStatistics] = None
-if (newTotalSize > 0 && newTotalSize != oldTotalSize) {
+if (newTotalSize >= 0 && newTotalSize != oldTotalSize) {
--- End diff --

This is a bug we should fix and backport it to the previous releases


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18292: [SPARK-21083][SQL] Do not remove correct column s...

2017-07-07 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/18292#discussion_r126275485
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzeColumnCommand.scala
 ---
@@ -42,17 +42,20 @@ case class AnalyzeColumnCommand(
 if (tableMeta.tableType == CatalogTableType.VIEW) {
   throw new AnalysisException("ANALYZE TABLE is not supported on 
views.")
 }
-val sizeInBytes = CommandUtils.calculateTotalSize(sessionState, 
tableMeta)
 
+val newSize = CommandUtils.calculateTotalSize(sessionState, tableMeta)
 // Compute stats for each column
-val (rowCount, newColStats) = computeColumnStats(sparkSession, 
tableIdentWithDB, columnNames)
+val (newRowCount, colStats) = computeColumnStats(sparkSession, 
tableIdentWithDB, columnNames)
+
+// Because we will invalidate or update stats when table is changed, 
if column stats exist,
+// they should be correct. Hence, we can combine previous column stats 
and the newly collected
+// column stats.
+val oldColStats = tableMeta.stats.map(_.colStats).getOrElse(Map.empty)
+val newColStats = oldColStats ++ colStats
 
 // We also update table-level stats in order to keep them consistent 
with column-level stats.
 val statistics = CatalogStatistics(
-  sizeInBytes = sizeInBytes,
-  rowCount = Some(rowCount),
-  // Newly computed column stats should override the existing ones.
-  colStats = tableMeta.stats.map(_.colStats).getOrElse(Map.empty) ++ 
newColStats)
--- End diff --

Yes. I did not find any semantics difference. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18307: [SPARK-21100][SQL] Add summary method as alternat...

2017-07-07 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/18307


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18307: [SPARK-21100][SQL] Add summary method as alternative to ...

2017-07-07 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/18307
  
thanks, merging to master!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18567: [SPARK-21345][SQL][TEST][test-maven] SparkSessionBuilder...

2017-07-07 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18567
  
**[Test build #79369 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79369/testReport)**
 for PR 18567 at commit 
[`4f4b452`](https://github.com/apache/spark/commit/4f4b4522e76b289b576888789b0b0f26fd14f8ac).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18566: [SPARK-21343] Refine the document for spark.reducer.maxR...

2017-07-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18566
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18566: [SPARK-21343] Refine the document for spark.reducer.maxR...

2017-07-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18566
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/79358/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18566: [SPARK-21343] Refine the document for spark.reduc...

2017-07-07 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/18566#discussion_r126275247
  
--- Diff: docs/configuration.md ---
@@ -529,6 +529,15 @@ Apart from these, the following properties are also 
available, and may be useful
   
 
 
+  spark.reducer.maxReqSizeShuffleToMem
+  200m
--- End diff --

this is wrong


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18566: [SPARK-21343] Refine the document for spark.reduc...

2017-07-07 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/18566#discussion_r126275245
  
--- Diff: docs/configuration.md ---
@@ -529,6 +529,15 @@ Apart from these, the following properties are also 
available, and may be useful
   
 
 
+  spark.reducer.maxReqSizeShuffleToMem
+  200m
+  
+The blocks of a shuffle request will be fetched to disk when size of 
the request is above
+this threshold. This is to avoid a giant request takes too much 
memory. Note that it's
+necessary to have the support of shuffle service(at least Spark-2.2) 
when enable this config.
--- End diff --

let's also mention how to enable this config, e.g. set it to `200m`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18566: [SPARK-21343] Refine the document for spark.reducer.maxR...

2017-07-07 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18566
  
**[Test build #79358 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79358/testReport)**
 for PR 18566 at commit 
[`1a042dd`](https://github.com/apache/spark/commit/1a042dd5f0448eced1e90be146c9f47a44a5a01a).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18566: [SPARK-21343] Refine the document for spark.reduc...

2017-07-07 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/18566#discussion_r126275228
  
--- Diff: 
core/src/main/scala/org/apache/spark/internal/config/package.scala ---
@@ -323,9 +323,10 @@ package object config {
 
   private[spark] val REDUCER_MAX_REQ_SIZE_SHUFFLE_TO_MEM =
 ConfigBuilder("spark.reducer.maxReqSizeShuffleToMem")
-  .internal()
   .doc("The blocks of a shuffle request will be fetched to disk when 
size of the request is " +
-"above this threshold. This is to avoid a giant request takes too 
much memory.")
+"above this threshold. This is to avoid a giant request takes too 
much memory. Note that" +
+" it's necessary to have the support of shuffle service(at least 
Spark-2.2) when enable " +
--- End diff --

is it possible to implement this feature without shuffle service?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18567: [SPARK-21345][SQL][TEST][test-maven] SparkSessionBuilder...

2017-07-07 Thread dongjoon-hyun

Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/18567
  
Now, each unit test becomes independent.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18555: [Minor]add checkValue in spark.internal.config about how...

2017-07-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18555
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18567: [SPARK-21345][SQL][TEST][test-maven] SparkSession...

2017-07-07 Thread dongjoon-hyun

Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/18567#discussion_r126275145
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/SparkSessionBuilderSuite.scala ---
@@ -36,18 +38,22 @@ class SparkSessionBuilderSuite extends SparkFunSuite {
 initialSession.sparkContext
   }
 
+  override def afterEach(): Unit = {
+// This suite should not interfere with the other test suites.
+SparkSession.clearDefaultSession()
+SparkSession.clearActiveSession()
--- End diff --

We need to update the unit test case because SparkContext requires 
`spark.master`. I'll update like that.
```
if (!_conf.contains("spark.master")) {
  throw new SparkException("A master URL must be set in your 
configuration")
}
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18555: [Minor]add checkValue in spark.internal.config about how...

2017-07-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18555
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/79361/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18555: [Minor]add checkValue in spark.internal.config about how...

2017-07-07 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18555
  
**[Test build #79361 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79361/testReport)**
 for PR 18555 at commit 
[`b19b378`](https://github.com/apache/spark/commit/b19b378ed72bfa910e889f23bc4f848324445d08).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16630: [SPARK-19270][ML] Add summary table to GLM summary

2017-07-07 Thread actuaryzhang

Github user actuaryzhang commented on the issue:

https://github.com/apache/spark/pull/16630
  
@yanboliang Could you take a look? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18393: [SPARK-20342][core] Update task accumulators before send...

2017-07-07 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/18393
  
LGTM, pending test


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18567: [SPARK-21345][SQL][TEST][test-maven] SparkSession...

2017-07-07 Thread dongjoon-hyun

Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/18567#discussion_r126274957
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/SparkSessionBuilderSuite.scala ---
@@ -17,13 +17,15 @@
 
 package org.apache.spark.sql
 
+import org.scalatest.BeforeAndAfterEach
+
 import org.apache.spark.{SparkConf, SparkContext, SparkFunSuite}
 import org.apache.spark.sql.internal.SQLConf
 
 /**
  * Test cases for the builder pattern of [[SparkSession]].
  */
-class SparkSessionBuilderSuite extends SparkFunSuite {
+class SparkSessionBuilderSuite extends SparkFunSuite with 
BeforeAndAfterEach {
 
   private var initialSession: SparkSession = _
--- End diff --

Ah, yes. The other two reference are just `stop`. I will move this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18292: [SPARK-21083][SQL] Do not remove correct column stats wh...

2017-07-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18292
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18292: [SPARK-21083][SQL] Do not remove correct column stats wh...

2017-07-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18292
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/79360/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18292: [SPARK-21083][SQL] Do not remove correct column stats wh...

2017-07-07 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18292
  
**[Test build #79360 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79360/testReport)**
 for PR 18292 at commit 
[`b41adec`](https://github.com/apache/spark/commit/b41adeca65d05bb8de192969bcd0f28a0b4ee750).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18393: [SPARK-20342][core] Update task accumulators before send...

2017-07-07 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18393
  
**[Test build #79368 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79368/testReport)**
 for PR 18393 at commit 
[`b61fc0c`](https://github.com/apache/spark/commit/b61fc0c6d5e1db46f2df89ba43458bf0336ab90e).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18393: [SPARK-20342][core] Update task accumulators before send...

2017-07-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18393
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18393: [SPARK-20342][core] Update task accumulators before send...

2017-07-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18393
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/79357/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18567: [SPARK-21345][SQL][TEST][test-maven] SparkSession...

2017-07-07 Thread dongjoon-hyun

Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/18567#discussion_r126274862
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/SparkSessionBuilderSuite.scala ---
@@ -36,18 +38,22 @@ class SparkSessionBuilderSuite extends SparkFunSuite {
 initialSession.sparkContext
   }
 
+  override def afterEach(): Unit = {
+// This suite should not interfere with the other test suites.
+SparkSession.clearDefaultSession()
+SparkSession.clearActiveSession()
--- End diff --

Sure!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18393: [SPARK-20342][core] Update task accumulators before send...

2017-07-07 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18393
  
**[Test build #79357 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79357/testReport)**
 for PR 18393 at commit 
[`909c375`](https://github.com/apache/spark/commit/909c375aede8501d81da00e65187ff444d652665).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18567: [SPARK-21345][SQL][TEST][test-maven] SparkSession...

2017-07-07 Thread dongjoon-hyun

Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/18567#discussion_r126274860
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/SparkSessionBuilderSuite.scala ---
@@ -17,13 +17,15 @@
 
 package org.apache.spark.sql
 
+import org.scalatest.BeforeAndAfterEach
+
 import org.apache.spark.{SparkConf, SparkContext, SparkFunSuite}
 import org.apache.spark.sql.internal.SQLConf
 
 /**
  * Test cases for the builder pattern of [[SparkSession]].
  */
-class SparkSessionBuilderSuite extends SparkFunSuite {
+class SparkSessionBuilderSuite extends SparkFunSuite with 
BeforeAndAfterEach {
 
   private var initialSession: SparkSession = _
--- End diff --

This is also used in `sparkContext` and `sparkContext` is used in 3 test 
cases.
```
private lazy val sparkContext: SparkContext = {
initialSession = SparkSession.builder()
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18468: [SPARK-20873][SQL] Enhance ColumnVector to support compr...

2017-07-07 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18468
  
**[Test build #79367 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79367/testReport)**
 for PR 18468 at commit 
[`2c9e63e`](https://github.com/apache/spark/commit/2c9e63eb88e0e169132c994a575e18141fb99037).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18567: [SPARK-21345][SQL][TEST][test-maven] SparkSession...

2017-07-07 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/18567#discussion_r126274671
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/SparkSessionBuilderSuite.scala ---
@@ -17,13 +17,15 @@
 
 package org.apache.spark.sql
 
+import org.scalatest.BeforeAndAfterEach
+
 import org.apache.spark.{SparkConf, SparkContext, SparkFunSuite}
 import org.apache.spark.sql.internal.SQLConf
 
 /**
  * Test cases for the builder pattern of [[SparkSession]].
  */
-class SparkSessionBuilderSuite extends SparkFunSuite {
+class SparkSessionBuilderSuite extends SparkFunSuite with 
BeforeAndAfterEach {
 
   private var initialSession: SparkSession = _
--- End diff --

Seems this is only used in the first test case? shall we just move it to 
the first test case?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18567: [SPARK-21345][SQL][TEST][test-maven] SparkSession...

2017-07-07 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/18567#discussion_r126274672
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/SparkSessionBuilderSuite.scala ---
@@ -36,18 +38,22 @@ class SparkSessionBuilderSuite extends SparkFunSuite {
 initialSession.sparkContext
   }
 
+  override def afterEach(): Unit = {
+// This suite should not interfere with the other test suites.
+SparkSession.clearDefaultSession()
+SparkSession.clearActiveSession()
--- End diff --

shall we also stop the session?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18568: [Revert][SPARK-21307][SQL] Remove SQLConf parameters fro...

2017-07-07 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/18568
  
can we fix the active session problem? e.g. set active session in 
`SparkSession.sql`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18562: [SPARK-21069][SS][DOCS] Add rate source to programming g...

2017-07-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18562
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/79365/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18562: [SPARK-21069][SS][DOCS] Add rate source to programming g...

2017-07-07 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18562
  
**[Test build #79365 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79365/testReport)**
 for PR 18562 at commit 
[`c33943d`](https://github.com/apache/spark/commit/c33943d61f3f1a0209b9850c350de06c63344ca1).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18562: [SPARK-21069][SS][DOCS] Add rate source to programming g...

2017-07-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18562
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18292: [SPARK-21083][SQL] Do not remove correct column s...

2017-07-07 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/18292#discussion_r126274515
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzeColumnCommand.scala
 ---
@@ -42,17 +42,20 @@ case class AnalyzeColumnCommand(
 if (tableMeta.tableType == CatalogTableType.VIEW) {
   throw new AnalysisException("ANALYZE TABLE is not supported on 
views.")
 }
-val sizeInBytes = CommandUtils.calculateTotalSize(sessionState, 
tableMeta)
 
+val newSize = CommandUtils.calculateTotalSize(sessionState, tableMeta)
 // Compute stats for each column
-val (rowCount, newColStats) = computeColumnStats(sparkSession, 
tableIdentWithDB, columnNames)
+val (newRowCount, colStats) = computeColumnStats(sparkSession, 
tableIdentWithDB, columnNames)
+
+// Because we will invalidate or update stats when table is changed, 
if column stats exist,
+// they should be correct. Hence, we can combine previous column stats 
and the newly collected
+// column stats.
+val oldColStats = tableMeta.stats.map(_.colStats).getOrElse(Map.empty)
+val newColStats = oldColStats ++ colStats
 
 // We also update table-level stats in order to keep them consistent 
with column-level stats.
 val statistics = CatalogStatistics(
-  sizeInBytes = sizeInBytes,
-  rowCount = Some(rowCount),
-  // Newly computed column stats should override the existing ones.
-  colStats = tableMeta.stats.map(_.colStats).getOrElse(Map.empty) ++ 
newColStats)
--- End diff --

seems we already did it?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18468: [SPARK-20873][SQL] Enhance ColumnVector to support compr...

2017-07-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18468
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18468: [SPARK-20873][SQL] Enhance ColumnVector to support compr...

2017-07-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18468
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/79366/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16422: [SPARK-17642] [SQL] support DESC EXTENDED/FORMATT...

2017-07-07 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/16422#discussion_r126274465
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala ---
@@ -626,6 +624,117 @@ case class DescribeTableCommand(
   }
 }
 
+/**
+ * A command to list the info for a column, including name, data type, 
column stats and comment.
+ * This function creates a [[DescribeColumnCommand]] logical plan.
+ *
+ * The syntax of using this command in SQL is:
+ * {{{
+ *   DESCRIBE [EXTENDED|FORMATTED] table_name column_name;
+ * }}}
+ */
+case class DescribeColumnCommand(
+table: TableIdentifier,
+colNameParts: Seq[String],
+isFormatted: Boolean)
+  extends RunnableCommand {
+
+  override val output: Seq[Attribute] = {
+// The displayed names are based on Hive.
+// (Link for the corresponding Hive Jira: 
https://issues.apache.org/jira/browse/HIVE-7050)
+if (isFormatted) {
+  Seq(
+AttributeReference("col_name", StringType, nullable = false,
+  new MetadataBuilder().putString("comment", "name of the 
column").build())(),
+AttributeReference("data_type", StringType, nullable = false,
+  new MetadataBuilder().putString("comment", "data type of the 
column").build())(),
+AttributeReference("min", StringType, nullable = true,
+  new MetadataBuilder().putString("comment", "min value of the 
column").build())(),
+AttributeReference("max", StringType, nullable = true,
+  new MetadataBuilder().putString("comment", "max value of the 
column").build())(),
+AttributeReference("num_nulls", StringType, nullable = true,
+  new MetadataBuilder().putString("comment", "number of nulls of 
the column").build())(),
+AttributeReference("distinct_count", StringType, nullable = true,
+  new MetadataBuilder().putString("comment", "distinct count of 
the column").build())(),
+AttributeReference("avg_col_len", StringType, nullable = true,
+  new MetadataBuilder().putString("comment",
+"average length of the values of the column").build())(),
+AttributeReference("max_col_len", StringType, nullable = true,
+  new MetadataBuilder().putString("comment",
+"maximum length of the values of the column").build())(),
+AttributeReference("comment", StringType, nullable = true,
+  new MetadataBuilder().putString("comment", "comment of the 
column").build())())
+} else {
+  Seq(
+AttributeReference("col_name", StringType, nullable = false,
+  new MetadataBuilder().putString("comment", "name of the 
column").build())(),
+AttributeReference("data_type", StringType, nullable = false,
+  new MetadataBuilder().putString("comment", "data type of the 
column").build())(),
+AttributeReference("comment", StringType, nullable = true,
+  new MetadataBuilder().putString("comment", "comment of the 
column").build())())
+}
+  }
+
+  override def run(sparkSession: SparkSession): Seq[Row] = {
+val catalog = sparkSession.sessionState.catalog
+val resolver = sparkSession.sessionState.conf.resolver
+val relation = sparkSession.table(table).queryExecution.analyzed
+val attribute = {
+  val field = relation.resolve(
--- End diff --

instead of changing code in `LogicalPlan`, can't we just do
```
val field = relation.resolve(colNameParts, resolver).getOrElse {
  throw ...
}
if (!field.isInstanceOf[Attribute]) {
  throw ...
}
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18468: [SPARK-20873][SQL] Enhance ColumnVector to support compr...

2017-07-07 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18468
  
**[Test build #79366 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79366/testReport)**
 for PR 18468 at commit 
[`101e4b7`](https://github.com/apache/spark/commit/101e4b79b0e341d28b2fd97c97764c35f9076db6).
 * This patch **fails to build**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16422: [SPARK-17642] [SQL] support DESC EXTENDED/FORMATT...

2017-07-07 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/16422#discussion_r126274424
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala ---
@@ -320,10 +320,15 @@ class SparkSqlAstBuilder(conf: SQLConf) extends 
AstBuilder(conf) {
* Create a [[DescribeTableCommand]] logical plan.
*/
   override def visitDescribeTable(ctx: DescribeTableContext): LogicalPlan 
= withOrigin(ctx) {
-// Describe column are not supported yet. Return null and let the 
parser decide
-// what to do with this (create an exception or pass it on to a 
different system).
 if (ctx.describeColName != null) {
-  null
+  if (ctx.partitionSpec != null) {
+throw new ParseException("DESC TABLE COLUMN for a specific 
partition is not supported", ctx)
+  } else {
+DescribeColumnCommand(
+  visitTableIdentifier(ctx.tableIdentifier),
+  ctx.describeColName.nameParts.asScala.map(_.getText),
+  ctx.FORMATTED != null)
--- End diff --

can we use `ctx.EXTENDED != null || ctx.FORMATTED != null`? I think this is 
more reasonable, although different from hive.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18562: [SPARK-21069][SS][DOCS] Add rate source to programming g...

2017-07-07 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18562
  
**[Test build #79365 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79365/testReport)**
 for PR 18562 at commit 
[`c33943d`](https://github.com/apache/spark/commit/c33943d61f3f1a0209b9850c350de06c63344ca1).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18468: [SPARK-20873][SQL] Enhance ColumnVector to support compr...

2017-07-07 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18468
  
**[Test build #79366 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79366/testReport)**
 for PR 18468 at commit 
[`101e4b7`](https://github.com/apache/spark/commit/101e4b79b0e341d28b2fd97c97764c35f9076db6).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18555: [Minor]add checkValue in spark.internal.config ab...

2017-07-07 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/18555#discussion_r126274206
  
--- Diff: 
core/src/main/scala/org/apache/spark/internal/config/package.scala ---
@@ -88,50 +137,76 @@ package object config {
 .createOptional
 
   private[spark] val PY_FILES = ConfigBuilder("spark.submit.pyFiles")
+.doc("Comma-separated list of .zip, .egg, or .py files to place on " +
+  "the PYTHONPATH for Python apps.")
 .internal()
 .stringConf
 .toSequence
 .createWithDefault(Nil)
 
   private[spark] val MAX_TASK_FAILURES =
 ConfigBuilder("spark.task.maxFailures")
+  .doc("Number of failures of any particular task before giving up on 
the job. " +
+"The total number of failures spread across different tasks " +
+"will not cause the job to fail; " +
+"a particular task has to fail this number of attempts. " +
+"Should be greater than or equal to 1. Number of allowed retries = 
this value - 1.")
   .intConf
+  .checkValue(_ > 0, "The retry times of task should be greater than 
or equal to 1")
   .createWithDefault(4)
 
   // Blacklist confs
   private[spark] val BLACKLIST_ENABLED =
 ConfigBuilder("spark.blacklist.enabled")
+  .doc("If set to 'true', prevent Spark from scheduling tasks on 
executors " +
+"that have been blacklisted due to too many task failures. " +
+"The blacklisting algorithm can be further controlled by " +
+"the other 'spark.blacklist' configuration options. ")
   .booleanConf
   .createOptional
 
   private[spark] val MAX_TASK_ATTEMPTS_PER_EXECUTOR =
 ConfigBuilder("spark.blacklist.task.maxTaskAttemptsPerExecutor")
+  .doc("For a given task, how many times it can be " +
+"retried on one executor before the executor is blacklisted for 
that task.")
   .intConf
+  .checkValue(_ > 0, "The value should be greater than or equal to 1")
   .createWithDefault(1)
 
   private[spark] val MAX_TASK_ATTEMPTS_PER_NODE =
 ConfigBuilder("spark.blacklist.task.maxTaskAttemptsPerNode")
+  .doc("For a given task, how many times it can be " +
+"retried on one node, before the entire node is blacklisted for 
that task.")
   .intConf
+  .checkValue(_ > 0, "The value should be greater than or equal to 1")
   .createWithDefault(2)
 
   private[spark] val MAX_FAILURES_PER_EXEC =
 ConfigBuilder("spark.blacklist.application.maxFailedTasksPerExecutor")
   .intConf
+  .checkValue(_ > 0, "The value should be greater than or equal to 1")
   .createWithDefault(2)
 
   private[spark] val MAX_FAILURES_PER_EXEC_STAGE =
 ConfigBuilder("spark.blacklist.stage.maxFailedTasksPerExecutor")
+  .doc("How many different tasks must fail on one executor, " +
+"within one stage, before the executor is blacklisted for that 
stage.")
   .intConf
+  .checkValue(_ > 0, "The value should be greater than or equal to 1")
   .createWithDefault(2)
 
   private[spark] val MAX_FAILED_EXEC_PER_NODE =
 ConfigBuilder("spark.blacklist.application.maxFailedExecutorsPerNode")
   .intConf
+  .checkValue(_ > 0, "The value should be greater than or equal to 1")
   .createWithDefault(2)
 
   private[spark] val MAX_FAILED_EXEC_PER_NODE_STAGE =
 ConfigBuilder("spark.blacklist.stage.maxFailedExecutorsPerNode")
+  .doc("How many different executors are marked as blacklisted for a 
given stage, " +
+"before the entire node is marked as failed for the stage.")
   .intConf
+  .checkValue(_ > 0, "The value should be greater than or equal to 1")
--- End diff --

ditto


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18555: [Minor]add checkValue in spark.internal.config ab...

2017-07-07 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/18555#discussion_r126274201
  
--- Diff: 
core/src/main/scala/org/apache/spark/internal/config/package.scala ---
@@ -88,50 +137,76 @@ package object config {
 .createOptional
 
   private[spark] val PY_FILES = ConfigBuilder("spark.submit.pyFiles")
+.doc("Comma-separated list of .zip, .egg, or .py files to place on " +
+  "the PYTHONPATH for Python apps.")
 .internal()
 .stringConf
 .toSequence
 .createWithDefault(Nil)
 
   private[spark] val MAX_TASK_FAILURES =
 ConfigBuilder("spark.task.maxFailures")
+  .doc("Number of failures of any particular task before giving up on 
the job. " +
+"The total number of failures spread across different tasks " +
+"will not cause the job to fail; " +
+"a particular task has to fail this number of attempts. " +
+"Should be greater than or equal to 1. Number of allowed retries = 
this value - 1.")
   .intConf
+  .checkValue(_ > 0, "The retry times of task should be greater than 
or equal to 1")
   .createWithDefault(4)
 
   // Blacklist confs
   private[spark] val BLACKLIST_ENABLED =
 ConfigBuilder("spark.blacklist.enabled")
+  .doc("If set to 'true', prevent Spark from scheduling tasks on 
executors " +
+"that have been blacklisted due to too many task failures. " +
+"The blacklisting algorithm can be further controlled by " +
+"the other 'spark.blacklist' configuration options. ")
   .booleanConf
   .createOptional
 
   private[spark] val MAX_TASK_ATTEMPTS_PER_EXECUTOR =
 ConfigBuilder("spark.blacklist.task.maxTaskAttemptsPerExecutor")
+  .doc("For a given task, how many times it can be " +
+"retried on one executor before the executor is blacklisted for 
that task.")
   .intConf
+  .checkValue(_ > 0, "The value should be greater than or equal to 1")
   .createWithDefault(1)
 
   private[spark] val MAX_TASK_ATTEMPTS_PER_NODE =
 ConfigBuilder("spark.blacklist.task.maxTaskAttemptsPerNode")
+  .doc("For a given task, how many times it can be " +
+"retried on one node, before the entire node is blacklisted for 
that task.")
   .intConf
+  .checkValue(_ > 0, "The value should be greater than or equal to 1")
   .createWithDefault(2)
 
   private[spark] val MAX_FAILURES_PER_EXEC =
 ConfigBuilder("spark.blacklist.application.maxFailedTasksPerExecutor")
   .intConf
+  .checkValue(_ > 0, "The value should be greater than or equal to 1")
--- End diff --

ditto


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18555: [Minor]add checkValue in spark.internal.config ab...

2017-07-07 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/18555#discussion_r126274200
  
--- Diff: 
core/src/main/scala/org/apache/spark/internal/config/package.scala ---
@@ -88,50 +137,76 @@ package object config {
 .createOptional
 
   private[spark] val PY_FILES = ConfigBuilder("spark.submit.pyFiles")
+.doc("Comma-separated list of .zip, .egg, or .py files to place on " +
+  "the PYTHONPATH for Python apps.")
 .internal()
 .stringConf
 .toSequence
 .createWithDefault(Nil)
 
   private[spark] val MAX_TASK_FAILURES =
 ConfigBuilder("spark.task.maxFailures")
+  .doc("Number of failures of any particular task before giving up on 
the job. " +
+"The total number of failures spread across different tasks " +
+"will not cause the job to fail; " +
+"a particular task has to fail this number of attempts. " +
+"Should be greater than or equal to 1. Number of allowed retries = 
this value - 1.")
   .intConf
+  .checkValue(_ > 0, "The retry times of task should be greater than 
or equal to 1")
   .createWithDefault(4)
 
   // Blacklist confs
   private[spark] val BLACKLIST_ENABLED =
 ConfigBuilder("spark.blacklist.enabled")
+  .doc("If set to 'true', prevent Spark from scheduling tasks on 
executors " +
+"that have been blacklisted due to too many task failures. " +
+"The blacklisting algorithm can be further controlled by " +
+"the other 'spark.blacklist' configuration options. ")
   .booleanConf
   .createOptional
 
   private[spark] val MAX_TASK_ATTEMPTS_PER_EXECUTOR =
 ConfigBuilder("spark.blacklist.task.maxTaskAttemptsPerExecutor")
+  .doc("For a given task, how many times it can be " +
+"retried on one executor before the executor is blacklisted for 
that task.")
   .intConf
+  .checkValue(_ > 0, "The value should be greater than or equal to 1")
   .createWithDefault(1)
 
   private[spark] val MAX_TASK_ATTEMPTS_PER_NODE =
 ConfigBuilder("spark.blacklist.task.maxTaskAttemptsPerNode")
+  .doc("For a given task, how many times it can be " +
+"retried on one node, before the entire node is blacklisted for 
that task.")
   .intConf
+  .checkValue(_ > 0, "The value should be greater than or equal to 1")
--- End diff --

ditto


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18555: [Minor]add checkValue in spark.internal.config ab...

2017-07-07 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/18555#discussion_r126274204
  
--- Diff: 
core/src/main/scala/org/apache/spark/internal/config/package.scala ---
@@ -88,50 +137,76 @@ package object config {
 .createOptional
 
   private[spark] val PY_FILES = ConfigBuilder("spark.submit.pyFiles")
+.doc("Comma-separated list of .zip, .egg, or .py files to place on " +
+  "the PYTHONPATH for Python apps.")
 .internal()
 .stringConf
 .toSequence
 .createWithDefault(Nil)
 
   private[spark] val MAX_TASK_FAILURES =
 ConfigBuilder("spark.task.maxFailures")
+  .doc("Number of failures of any particular task before giving up on 
the job. " +
+"The total number of failures spread across different tasks " +
+"will not cause the job to fail; " +
+"a particular task has to fail this number of attempts. " +
+"Should be greater than or equal to 1. Number of allowed retries = 
this value - 1.")
   .intConf
+  .checkValue(_ > 0, "The retry times of task should be greater than 
or equal to 1")
   .createWithDefault(4)
 
   // Blacklist confs
   private[spark] val BLACKLIST_ENABLED =
 ConfigBuilder("spark.blacklist.enabled")
+  .doc("If set to 'true', prevent Spark from scheduling tasks on 
executors " +
+"that have been blacklisted due to too many task failures. " +
+"The blacklisting algorithm can be further controlled by " +
+"the other 'spark.blacklist' configuration options. ")
   .booleanConf
   .createOptional
 
   private[spark] val MAX_TASK_ATTEMPTS_PER_EXECUTOR =
 ConfigBuilder("spark.blacklist.task.maxTaskAttemptsPerExecutor")
+  .doc("For a given task, how many times it can be " +
+"retried on one executor before the executor is blacklisted for 
that task.")
   .intConf
+  .checkValue(_ > 0, "The value should be greater than or equal to 1")
   .createWithDefault(1)
 
   private[spark] val MAX_TASK_ATTEMPTS_PER_NODE =
 ConfigBuilder("spark.blacklist.task.maxTaskAttemptsPerNode")
+  .doc("For a given task, how many times it can be " +
+"retried on one node, before the entire node is blacklisted for 
that task.")
   .intConf
+  .checkValue(_ > 0, "The value should be greater than or equal to 1")
   .createWithDefault(2)
 
   private[spark] val MAX_FAILURES_PER_EXEC =
 ConfigBuilder("spark.blacklist.application.maxFailedTasksPerExecutor")
   .intConf
+  .checkValue(_ > 0, "The value should be greater than or equal to 1")
   .createWithDefault(2)
 
   private[spark] val MAX_FAILURES_PER_EXEC_STAGE =
 ConfigBuilder("spark.blacklist.stage.maxFailedTasksPerExecutor")
+  .doc("How many different tasks must fail on one executor, " +
+"within one stage, before the executor is blacklisted for that 
stage.")
   .intConf
+  .checkValue(_ > 0, "The value should be greater than or equal to 1")
   .createWithDefault(2)
 
   private[spark] val MAX_FAILED_EXEC_PER_NODE =
 ConfigBuilder("spark.blacklist.application.maxFailedExecutorsPerNode")
   .intConf
+  .checkValue(_ > 0, "The value should be greater than or equal to 1")
--- End diff --

ditto


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18555: [Minor]add checkValue in spark.internal.config ab...

2017-07-07 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/18555#discussion_r126274198
  
--- Diff: 
core/src/main/scala/org/apache/spark/internal/config/package.scala ---
@@ -88,50 +137,76 @@ package object config {
 .createOptional
 
   private[spark] val PY_FILES = ConfigBuilder("spark.submit.pyFiles")
+.doc("Comma-separated list of .zip, .egg, or .py files to place on " +
+  "the PYTHONPATH for Python apps.")
 .internal()
 .stringConf
 .toSequence
 .createWithDefault(Nil)
 
   private[spark] val MAX_TASK_FAILURES =
 ConfigBuilder("spark.task.maxFailures")
+  .doc("Number of failures of any particular task before giving up on 
the job. " +
+"The total number of failures spread across different tasks " +
+"will not cause the job to fail; " +
+"a particular task has to fail this number of attempts. " +
+"Should be greater than or equal to 1. Number of allowed retries = 
this value - 1.")
   .intConf
+  .checkValue(_ > 0, "The retry times of task should be greater than 
or equal to 1")
   .createWithDefault(4)
 
   // Blacklist confs
   private[spark] val BLACKLIST_ENABLED =
 ConfigBuilder("spark.blacklist.enabled")
+  .doc("If set to 'true', prevent Spark from scheduling tasks on 
executors " +
+"that have been blacklisted due to too many task failures. " +
+"The blacklisting algorithm can be further controlled by " +
+"the other 'spark.blacklist' configuration options. ")
   .booleanConf
   .createOptional
 
   private[spark] val MAX_TASK_ATTEMPTS_PER_EXECUTOR =
 ConfigBuilder("spark.blacklist.task.maxTaskAttemptsPerExecutor")
+  .doc("For a given task, how many times it can be " +
+"retried on one executor before the executor is blacklisted for 
that task.")
   .intConf
+  .checkValue(_ > 0, "The value should be greater than or equal to 1")
--- End diff --

`should be greater than 0`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18555: [Minor]add checkValue in spark.internal.config ab...

2017-07-07 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/18555#discussion_r126274203
  
--- Diff: 
core/src/main/scala/org/apache/spark/internal/config/package.scala ---
@@ -88,50 +137,76 @@ package object config {
 .createOptional
 
   private[spark] val PY_FILES = ConfigBuilder("spark.submit.pyFiles")
+.doc("Comma-separated list of .zip, .egg, or .py files to place on " +
+  "the PYTHONPATH for Python apps.")
 .internal()
 .stringConf
 .toSequence
 .createWithDefault(Nil)
 
   private[spark] val MAX_TASK_FAILURES =
 ConfigBuilder("spark.task.maxFailures")
+  .doc("Number of failures of any particular task before giving up on 
the job. " +
+"The total number of failures spread across different tasks " +
+"will not cause the job to fail; " +
+"a particular task has to fail this number of attempts. " +
+"Should be greater than or equal to 1. Number of allowed retries = 
this value - 1.")
   .intConf
+  .checkValue(_ > 0, "The retry times of task should be greater than 
or equal to 1")
   .createWithDefault(4)
 
   // Blacklist confs
   private[spark] val BLACKLIST_ENABLED =
 ConfigBuilder("spark.blacklist.enabled")
+  .doc("If set to 'true', prevent Spark from scheduling tasks on 
executors " +
+"that have been blacklisted due to too many task failures. " +
+"The blacklisting algorithm can be further controlled by " +
+"the other 'spark.blacklist' configuration options. ")
   .booleanConf
   .createOptional
 
   private[spark] val MAX_TASK_ATTEMPTS_PER_EXECUTOR =
 ConfigBuilder("spark.blacklist.task.maxTaskAttemptsPerExecutor")
+  .doc("For a given task, how many times it can be " +
+"retried on one executor before the executor is blacklisted for 
that task.")
   .intConf
+  .checkValue(_ > 0, "The value should be greater than or equal to 1")
   .createWithDefault(1)
 
   private[spark] val MAX_TASK_ATTEMPTS_PER_NODE =
 ConfigBuilder("spark.blacklist.task.maxTaskAttemptsPerNode")
+  .doc("For a given task, how many times it can be " +
+"retried on one node, before the entire node is blacklisted for 
that task.")
   .intConf
+  .checkValue(_ > 0, "The value should be greater than or equal to 1")
   .createWithDefault(2)
 
   private[spark] val MAX_FAILURES_PER_EXEC =
 ConfigBuilder("spark.blacklist.application.maxFailedTasksPerExecutor")
   .intConf
+  .checkValue(_ > 0, "The value should be greater than or equal to 1")
   .createWithDefault(2)
 
   private[spark] val MAX_FAILURES_PER_EXEC_STAGE =
 ConfigBuilder("spark.blacklist.stage.maxFailedTasksPerExecutor")
+  .doc("How many different tasks must fail on one executor, " +
+"within one stage, before the executor is blacklisted for that 
stage.")
   .intConf
+  .checkValue(_ > 0, "The value should be greater than or equal to 1")
--- End diff --

ditto


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18555: [Minor]add checkValue in spark.internal.config ab...

2017-07-07 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/18555#discussion_r126274183
  
--- Diff: 
core/src/main/scala/org/apache/spark/internal/config/package.scala ---
@@ -51,29 +63,66 @@ package object config {
 
ConfigBuilder(SparkLauncher.EXECUTOR_EXTRA_LIBRARY_PATH).stringConf.createOptional
 
   private[spark] val EXECUTOR_USER_CLASS_PATH_FIRST =
-
ConfigBuilder("spark.executor.userClassPathFirst").booleanConf.createWithDefault(false)
+ConfigBuilder("spark.executor.userClassPathFirst")
+  .doc("Same functionality as spark.driver.userClassPathFirst, " +
+"but applied to executor instances.")
+  .booleanConf
+  .checkValues(Set(false, true))
+  .createWithDefault(false)
 
   private[spark] val EXECUTOR_MEMORY = 
ConfigBuilder("spark.executor.memory")
+.doc("Amount of memory to use per executor process (e.g. 2g, 8g).")
 .bytesConf(ByteUnit.MiB)
+.checkValue(v => v > 0 && v <= Int.MaxValue / (1024 * 1024),
+  "The executor memory must be greater than 0M " +
+  s"and less than ${Int.MaxValue / (1024 * 1024)}M.")
 .createWithDefaultString("1g")
 
-  private[spark] val IS_PYTHON_APP = 
ConfigBuilder("spark.yarn.isPython").internal()
-.booleanConf.createWithDefault(false)
+  private[spark] val IS_PYTHON_APP = ConfigBuilder("spark.yarn.isPython")
+.internal()
+.booleanConf
+.checkValues(Set(false, true))
+.createWithDefault(false)
 
-  private[spark] val CPUS_PER_TASK = 
ConfigBuilder("spark.task.cpus").intConf.createWithDefault(1)
+  private[spark] val CPUS_PER_TASK = ConfigBuilder("spark.task.cpus")
+.doc("Number of cores to allocate for each task. ")
+.intConf
+.checkValue(v => v > 0,
+  "Number of cores to allocate for task event queue must not be 
negative")
+.createWithDefault(1)
 
   private[spark] val DYN_ALLOCATION_MIN_EXECUTORS =
-
ConfigBuilder("spark.dynamicAllocation.minExecutors").intConf.createWithDefault(0)
+ConfigBuilder("spark.dynamicAllocation.minExecutors")
+  .doc("Lower bound for the number of executors if dynamic allocation 
is enabled.")
+  .intConf
+  .checkValue(v => v >= 0,
+"Lower bound for the number of executors should be greater than or 
equal to 0")
+  .createWithDefault(0)
 
   private[spark] val DYN_ALLOCATION_INITIAL_EXECUTORS =
 ConfigBuilder("spark.dynamicAllocation.initialExecutors")
+  .doc("Initial number of executors to run if dynamic allocation is 
enabled. " +
+"If `--num-executors` (or `spark.executor.instances`) is set " +
+"and larger than this value, " +
+"it will be used as the initial number of executors.")
   .fallbackConf(DYN_ALLOCATION_MIN_EXECUTORS)
--- End diff --

check if >= 0?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18555: [Minor]add checkValue in spark.internal.config ab...

2017-07-07 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/18555#discussion_r126274186
  
--- Diff: 
core/src/main/scala/org/apache/spark/internal/config/package.scala ---
@@ -51,29 +63,66 @@ package object config {
 
ConfigBuilder(SparkLauncher.EXECUTOR_EXTRA_LIBRARY_PATH).stringConf.createOptional
 
   private[spark] val EXECUTOR_USER_CLASS_PATH_FIRST =
-
ConfigBuilder("spark.executor.userClassPathFirst").booleanConf.createWithDefault(false)
+ConfigBuilder("spark.executor.userClassPathFirst")
+  .doc("Same functionality as spark.driver.userClassPathFirst, " +
+"but applied to executor instances.")
+  .booleanConf
+  .checkValues(Set(false, true))
+  .createWithDefault(false)
 
   private[spark] val EXECUTOR_MEMORY = 
ConfigBuilder("spark.executor.memory")
+.doc("Amount of memory to use per executor process (e.g. 2g, 8g).")
 .bytesConf(ByteUnit.MiB)
+.checkValue(v => v > 0 && v <= Int.MaxValue / (1024 * 1024),
+  "The executor memory must be greater than 0M " +
+  s"and less than ${Int.MaxValue / (1024 * 1024)}M.")
 .createWithDefaultString("1g")
 
-  private[spark] val IS_PYTHON_APP = 
ConfigBuilder("spark.yarn.isPython").internal()
-.booleanConf.createWithDefault(false)
+  private[spark] val IS_PYTHON_APP = ConfigBuilder("spark.yarn.isPython")
+.internal()
+.booleanConf
+.checkValues(Set(false, true))
+.createWithDefault(false)
 
-  private[spark] val CPUS_PER_TASK = 
ConfigBuilder("spark.task.cpus").intConf.createWithDefault(1)
+  private[spark] val CPUS_PER_TASK = ConfigBuilder("spark.task.cpus")
+.doc("Number of cores to allocate for each task. ")
+.intConf
+.checkValue(v => v > 0,
+  "Number of cores to allocate for task event queue must not be 
negative")
+.createWithDefault(1)
 
   private[spark] val DYN_ALLOCATION_MIN_EXECUTORS =
-
ConfigBuilder("spark.dynamicAllocation.minExecutors").intConf.createWithDefault(0)
+ConfigBuilder("spark.dynamicAllocation.minExecutors")
+  .doc("Lower bound for the number of executors if dynamic allocation 
is enabled.")
+  .intConf
+  .checkValue(v => v >= 0,
+"Lower bound for the number of executors should be greater than or 
equal to 0")
+  .createWithDefault(0)
 
   private[spark] val DYN_ALLOCATION_INITIAL_EXECUTORS =
 ConfigBuilder("spark.dynamicAllocation.initialExecutors")
+  .doc("Initial number of executors to run if dynamic allocation is 
enabled. " +
+"If `--num-executors` (or `spark.executor.instances`) is set " +
+"and larger than this value, " +
+"it will be used as the initial number of executors.")
   .fallbackConf(DYN_ALLOCATION_MIN_EXECUTORS)
 
   private[spark] val DYN_ALLOCATION_MAX_EXECUTORS =
-
ConfigBuilder("spark.dynamicAllocation.maxExecutors").intConf.createWithDefault(Int.MaxValue)
+ConfigBuilder("spark.dynamicAllocation.maxExecutors")
+  .doc("Upper bound for the number of executors if dynamic allocation 
is enabled. ")
+  .intConf
+  .createWithDefault(Int.MaxValue)
 
   private[spark] val SHUFFLE_SERVICE_ENABLED =
-
ConfigBuilder("spark.shuffle.service.enabled").booleanConf.createWithDefault(false)
+ConfigBuilder("spark.shuffle.service.enabled")
+  .doc("Enables the external shuffle service. This service preserves " 
+
+"the shuffle files written by executors so the executors can be 
safely removed. " +
+"This must be enabled if spark.dynamicAllocation.enabled is 
'true'. " +
+"The external shuffle service must be set up in order to enable 
it. " +
+"See dynamic allocation configuration and setup documentation for 
more information. ")
+  .booleanConf
+  .checkValues(Set(false, true))
--- End diff --

ditto


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18555: [Minor]add checkValue in spark.internal.config ab...

2017-07-07 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/18555#discussion_r126274185
  
--- Diff: 
core/src/main/scala/org/apache/spark/internal/config/package.scala ---
@@ -51,29 +63,66 @@ package object config {
 
ConfigBuilder(SparkLauncher.EXECUTOR_EXTRA_LIBRARY_PATH).stringConf.createOptional
 
   private[spark] val EXECUTOR_USER_CLASS_PATH_FIRST =
-
ConfigBuilder("spark.executor.userClassPathFirst").booleanConf.createWithDefault(false)
+ConfigBuilder("spark.executor.userClassPathFirst")
+  .doc("Same functionality as spark.driver.userClassPathFirst, " +
+"but applied to executor instances.")
+  .booleanConf
+  .checkValues(Set(false, true))
+  .createWithDefault(false)
 
   private[spark] val EXECUTOR_MEMORY = 
ConfigBuilder("spark.executor.memory")
+.doc("Amount of memory to use per executor process (e.g. 2g, 8g).")
 .bytesConf(ByteUnit.MiB)
+.checkValue(v => v > 0 && v <= Int.MaxValue / (1024 * 1024),
+  "The executor memory must be greater than 0M " +
+  s"and less than ${Int.MaxValue / (1024 * 1024)}M.")
 .createWithDefaultString("1g")
 
-  private[spark] val IS_PYTHON_APP = 
ConfigBuilder("spark.yarn.isPython").internal()
-.booleanConf.createWithDefault(false)
+  private[spark] val IS_PYTHON_APP = ConfigBuilder("spark.yarn.isPython")
+.internal()
+.booleanConf
+.checkValues(Set(false, true))
+.createWithDefault(false)
 
-  private[spark] val CPUS_PER_TASK = 
ConfigBuilder("spark.task.cpus").intConf.createWithDefault(1)
+  private[spark] val CPUS_PER_TASK = ConfigBuilder("spark.task.cpus")
+.doc("Number of cores to allocate for each task. ")
+.intConf
+.checkValue(v => v > 0,
+  "Number of cores to allocate for task event queue must not be 
negative")
+.createWithDefault(1)
 
   private[spark] val DYN_ALLOCATION_MIN_EXECUTORS =
-
ConfigBuilder("spark.dynamicAllocation.minExecutors").intConf.createWithDefault(0)
+ConfigBuilder("spark.dynamicAllocation.minExecutors")
+  .doc("Lower bound for the number of executors if dynamic allocation 
is enabled.")
+  .intConf
+  .checkValue(v => v >= 0,
+"Lower bound for the number of executors should be greater than or 
equal to 0")
+  .createWithDefault(0)
 
   private[spark] val DYN_ALLOCATION_INITIAL_EXECUTORS =
 ConfigBuilder("spark.dynamicAllocation.initialExecutors")
+  .doc("Initial number of executors to run if dynamic allocation is 
enabled. " +
+"If `--num-executors` (or `spark.executor.instances`) is set " +
+"and larger than this value, " +
+"it will be used as the initial number of executors.")
   .fallbackConf(DYN_ALLOCATION_MIN_EXECUTORS)
 
   private[spark] val DYN_ALLOCATION_MAX_EXECUTORS =
-
ConfigBuilder("spark.dynamicAllocation.maxExecutors").intConf.createWithDefault(Int.MaxValue)
+ConfigBuilder("spark.dynamicAllocation.maxExecutors")
+  .doc("Upper bound for the number of executors if dynamic allocation 
is enabled. ")
+  .intConf
+  .createWithDefault(Int.MaxValue)
--- End diff --

check if >= 0?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18292: [SPARK-21083][SQL] Do not remove correct column stats wh...

2017-07-07 Thread wzhfy

Github user wzhfy commented on the issue:

https://github.com/apache/spark/pull/18292
  
@cloud-fan @gatorsmile This pr is updated based on the lastest code. Would 
you take another look?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16422: [SPARK-17642] [SQL] support DESC EXTENDED/FORMATTED tabl...

2017-07-07 Thread wzhfy

Github user wzhfy commented on the issue:

https://github.com/apache/spark/pull/16422
  
@cloud-fan @gatorsmile I've updated this pr, and solved the qualified 
column name/nested column issue, could you please check again?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18393: [SPARK-20342][core] Update task accumulators befo...

2017-07-07 Thread vanzin

Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/18393#discussion_r126273879
  
--- Diff: 
core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala ---
@@ -2277,6 +2277,29 @@ class DAGSchedulerSuite extends SparkFunSuite with 
LocalSparkContext with Timeou
   (Success, 1)))
   }
 
+  test("task end event should have updated accumulators (SPARK-20342)") {
+val accumIds = new HashSet[Long]()
+val listener = new SparkListener() {
+  override def onTaskEnd(event: SparkListenerTaskEnd): Unit = {
+event.taskInfo.accumulables.foreach { acc => accumIds += acc.id }
+  }
+}
+sc.addSparkListener(listener)
+
+// Try a few times in a loop to make sure. This is not guaranteed to 
fail when the bug exists,
+// but it should at least make the test flaky. If the bug is fixed, 
this should always pass.
+(1 to 10).foreach { _ =>
+  accumIds.clear()
+
+  val accum = sc.longAccumulator
+  sc.parallelize(1 to 10, 10).foreach { _ =>
--- End diff --

ah, I see. I'll tweak the test in a different way.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18421: [SPARK-21213][SQL] Support collecting partition-l...

2017-07-07 Thread wzhfy

Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/18421#discussion_r126273805
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala ---
@@ -93,30 +93,50 @@ class SparkSqlAstBuilder extends AstBuilder {
   }
 
   /**
-   * Create an [[AnalyzeTableCommand]] command or an 
[[AnalyzeColumnCommand]] command.
-   * Example SQL for analyzing table :
+   * Create an [[AnalyzeTableCommand]] command, or an 
[[AnalyzePartitionCommand]]
+   * or an [[AnalyzeColumnCommand]] command.
+   * Example SQL for analyzing table or a set of partitions :
* {{{
-   *   ANALYZE TABLE table COMPUTE STATISTICS [NOSCAN];
+   *   ANALYZE TABLE [db_name.]tablename [PARTITION (partcol1[=val1], 
partcol2[=val2], ...)]
+   *   COMPUTE STATISTICS [NOSCAN];
* }}}
+   *
* Example SQL for analyzing columns :
* {{{
-   *   ANALYZE TABLE table COMPUTE STATISTICS FOR COLUMNS column1, column2;
+   *   ANALYZE TABLE [db_name.]tablename COMPUTE STATISTICS FOR COLUMNS 
column1, column2;
* }}}
*/
   override def visitAnalyze(ctx: AnalyzeContext): LogicalPlan = 
withOrigin(ctx) {
-if (ctx.partitionSpec != null) {
-  logWarning(s"Partition specification is ignored: 
${ctx.partitionSpec.getText}")
+if (ctx.identifier != null &&
+ctx.identifier.getText.toLowerCase(Locale.ROOT) != "noscan") {
+  throw new ParseException(s"Expected `NOSCAN` instead of 
`${ctx.identifier.getText}`", ctx)
 }
-if (ctx.identifier != null) {
-  if (ctx.identifier.getText.toLowerCase(Locale.ROOT) != "noscan") {
-throw new ParseException(s"Expected `NOSCAN` instead of 
`${ctx.identifier.getText}`", ctx)
+
+val partitionSpec =
+  if (ctx.partitionSpec != null) {
+val filteredSpec = 
visitPartitionSpec(ctx.partitionSpec).filter(_._2.isDefined)
+if (filteredSpec.isEmpty) {
+  None
+} else {
+  Some(filteredSpec.mapValues(_.get))
+}
+  } else {
+None
+  }
+
+val table = visitTableIdentifier(ctx.tableIdentifier)
+if (ctx.identifierSeq() == null) {
+  if (partitionSpec.isDefined) {
+AnalyzePartitionCommand(table, partitionSpec.get, noscan = 
ctx.identifier != null)
+  } else {
+AnalyzeTableCommand(table, noscan = ctx.identifier != null)
--- End diff --

@mbasmanova IIUC, the logic is wrong here. For example, when analyzing 
partition (ds, hr), we should not remove them in parser. Currently we parse it 
to AnalyzeTableCommand, which collects table-level stats. But what we need to 
do is to collect partition-level stats for all partitions.
Check hive's behavior here


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18555: [Minor]add checkValue in spark.internal.config ab...

2017-07-07 Thread heary-cao

Github user heary-cao commented on a diff in the pull request:

https://github.com/apache/spark/pull/18555#discussion_r126273756
  
--- Diff: 
core/src/main/scala/org/apache/spark/internal/config/package.scala ---
@@ -35,10 +35,22 @@ package object config {
 
ConfigBuilder(SparkLauncher.DRIVER_EXTRA_LIBRARY_PATH).stringConf.createOptional
 
   private[spark] val DRIVER_USER_CLASS_PATH_FIRST =
-
ConfigBuilder("spark.driver.userClassPathFirst").booleanConf.createWithDefault(false)
+ConfigBuilder("spark.driver.userClassPathFirst")
+  .doc("Whether to give user-added jars precedence over Spark's own 
jars " +
+"when loading classes in the driver. This feature can be used to 
mitigate " +
+"conflicts between Spark's dependencies and user dependencies. " +
+"It is currently an experimental feature. This is used in cluster 
mode only. ")
+  .booleanConf
+  .checkValues(Set(false, true))
+  .createWithDefault(false)
 
   private[spark] val DRIVER_MEMORY = ConfigBuilder("spark.driver.memory")
+.doc("Amount of memory to use for the driver process, i.e. " +
+  "where SparkContext is initialized. (e.g. 1g, 2g). ")
--- End diff --

OK.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18555: [Minor]add checkValue in spark.internal.config ab...

2017-07-07 Thread heary-cao

Github user heary-cao commented on a diff in the pull request:

https://github.com/apache/spark/pull/18555#discussion_r126273760
  
--- Diff: 
core/src/main/scala/org/apache/spark/internal/config/package.scala ---
@@ -35,10 +35,22 @@ package object config {
 
ConfigBuilder(SparkLauncher.DRIVER_EXTRA_LIBRARY_PATH).stringConf.createOptional
 
   private[spark] val DRIVER_USER_CLASS_PATH_FIRST =
-
ConfigBuilder("spark.driver.userClassPathFirst").booleanConf.createWithDefault(false)
+ConfigBuilder("spark.driver.userClassPathFirst")
+  .doc("Whether to give user-added jars precedence over Spark's own 
jars " +
+"when loading classes in the driver. This feature can be used to 
mitigate " +
+"conflicts between Spark's dependencies and user dependencies. " +
+"It is currently an experimental feature. This is used in cluster 
mode only. ")
+  .booleanConf
+  .checkValues(Set(false, true))
--- End diff --

thanks.
please review it again.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18421: [SPARK-21213][SQL] Support collecting partition-l...

2017-07-07 Thread wzhfy

Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/18421#discussion_r126273773
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala ---
@@ -201,6 +202,193 @@ class StatisticsSuite extends 
StatisticsCollectionTestBase with TestHiveSingleto
 }
   }
 
+  test("analyze single partition") {
+val tableName = "analyzeTable_part"
+
+def queryStats(ds: String): CatalogStatistics = {
+  val partition =
+
spark.sessionState.catalog.getPartition(TableIdentifier(tableName), Map("ds" -> 
ds))
+  partition.stats.get
+}
+
+withTable(tableName) {
+  sql(s"CREATE TABLE $tableName (key STRING, value STRING) PARTITIONED 
BY (ds STRING)")
+
+  sql(s"INSERT INTO TABLE $tableName PARTITION (ds='2010-01-01') 
SELECT * FROM src")
+  sql(
+s"""
+   |INSERT INTO TABLE $tableName PARTITION (ds='2010-01-02')
+   |SELECT * FROM src
+   |UNION ALL
+   |SELECT * FROM src
+ """.stripMargin)
+  sql(s"INSERT INTO TABLE $tableName PARTITION (ds='2010-01-03') 
SELECT * FROM src")
+
+  sql(s"ANALYZE TABLE $tableName PARTITION (ds='2010-01-01') COMPUTE 
STATISTICS").collect()
+
+  sql(s"ANALYZE TABLE $tableName PARTITION (ds='2010-01-02') COMPUTE 
STATISTICS").collect()
+
+  assert(queryStats("2010-01-01").rowCount.get === 500)
+  assert(queryStats("2010-01-01").sizeInBytes === 5812)
+
+  assert(queryStats("2010-01-02").rowCount.get === 2*500)
+  assert(queryStats("2010-01-02").sizeInBytes === 2*5812)
+}
+  }
+
+  test("analyze single partition noscan") {
+val tableName = "analyzeTable_part"
+
+def queryStats(ds: String): CatalogStatistics = {
+  val partition =
+
spark.sessionState.catalog.getPartition(TableIdentifier(tableName), Map("ds" -> 
ds))
+  partition.stats.get
+}
+
+withTable(tableName) {
+  sql(s"CREATE TABLE $tableName (key STRING, value STRING) PARTITIONED 
BY (ds STRING)")
+
+  sql(s"INSERT INTO TABLE $tableName PARTITION (ds='2010-01-01') 
SELECT * FROM src")
+  sql(
+s"""
+   |INSERT INTO TABLE $tableName PARTITION (ds='2010-01-02')
+   |SELECT * FROM src
+   |UNION ALL
+   |SELECT * FROM src
+ """.stripMargin)
+  sql(s"INSERT INTO TABLE $tableName PARTITION (ds='2010-01-03') 
SELECT * FROM src")
+
+  sql(s"ANALYZE TABLE $tableName PARTITION (ds='2010-01-01') COMPUTE 
STATISTICS NOSCAN")
+.collect()
+
+  sql(s"ANALYZE TABLE $tableName PARTITION (ds='2010-01-02') COMPUTE 
STATISTICS NOSCAN")
+.collect()
+
+  assert(queryStats("2010-01-01").rowCount === None)
+  assert(queryStats("2010-01-01").sizeInBytes === 5812)
+
+  assert(queryStats("2010-01-02").rowCount === None)
+  assert(queryStats("2010-01-02").sizeInBytes === 2*5812)
+}
+  }
+
+  test("analyze a set of partitions") {
--- End diff --

@mbasmanova IIUC, the logic is wrong here. For example, when analyzing 
partition (ds, hr), we should not remove them in parser. Currently we parse it 
to `AnalyzeTableCommand`, which collects table-level stats. But what we need to 
do is to collect **partition-level stats for all partitions**.
Check hive's behavior 
[here](https://cwiki.apache.org/confluence/display/Hive/StatsDev#StatsDev-Examples)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18555: [Minor]add checkValue in spark.internal.config about how...

2017-07-07 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18555
  
**[Test build #79364 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79364/testReport)**
 for PR 18555 at commit 
[`6512814`](https://github.com/apache/spark/commit/651281417ee180263e2d95316327691551af10ab).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18567: [SPARK-21345][SQL][TEST][test-maven] SparkSessionBuilder...

2017-07-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18567
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/79356/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18567: [SPARK-21345][SQL][TEST][test-maven] SparkSessionBuilder...

2017-07-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18567
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18567: [SPARK-21345][SQL][TEST][test-maven] SparkSessionBuilder...

2017-07-07 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18567
  
**[Test build #79356 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79356/testReport)**
 for PR 18567 at commit 
[`802163d`](https://github.com/apache/spark/commit/802163ddecafe329c9d8b2161fac120bb26dd4b1).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18421: [SPARK-21213][SQL] Support collecting partition-l...

2017-07-07 Thread wzhfy

Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/18421#discussion_r126273150
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzeTableCommand.scala
 ---
@@ -17,14 +17,17 @@
 
 package org.apache.spark.sql.execution.command
 
-import org.apache.spark.sql.{AnalysisException, Row, SparkSession}
+import org.apache.spark.sql.{AnalysisException, Column, Row, SparkSession}
 import org.apache.spark.sql.catalyst.TableIdentifier
-import org.apache.spark.sql.catalyst.catalog.{CatalogStatistics, 
CatalogTableType}
-import org.apache.spark.sql.execution.SQLExecution
+import org.apache.spark.sql.catalyst.analysis.{NoSuchPartitionException, 
UnresolvedAttribute}
+import org.apache.spark.sql.catalyst.catalog.{CatalogStatistics, 
CatalogTable, CatalogTableType}
+import 
org.apache.spark.sql.catalyst.catalog.CatalogTypes.TablePartitionSpec
+import org.apache.spark.sql.catalyst.expressions.{And, EqualTo, 
Expression, Literal}
 
 
 /**
- * Analyzes the given table to generate statistics, which will be used in 
query optimizations.
+ * Analyzes the given table to generate statistics, which will be used in
+ * query optimizations.
--- End diff --

unnecessary change


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18421: [SPARK-21213][SQL] Support collecting partition-l...

2017-07-07 Thread wzhfy

Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/18421#discussion_r126273394
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala ---
@@ -181,6 +182,151 @@ class StatisticsSuite extends 
StatisticsCollectionTestBase with TestHiveSingleto
 }
   }
 
+  private val SELECT_FROM_SRC = "SELECT '1', 'A' from src"
+
+  test("analyze single partition") {
+val tableName = "analyzeTable_part"
+
+def queryStats(ds: String): CatalogStatistics = {
+  val partition =
+
spark.sessionState.catalog.getPartition(TableIdentifier(tableName), Map("ds" -> 
ds))
+  partition.stats.get
+}
+
+def createPartition(ds: String, query: String): Unit = {
+  sql(s"INSERT INTO TABLE $tableName PARTITION (ds='$ds') $query")
+}
+
+withTable(tableName) {
+  sql(s"CREATE TABLE $tableName (key STRING, value STRING) PARTITIONED 
BY (ds STRING)")
+
+  createPartition("2010-01-01", SELECT_FROM_SRC)
+  createPartition("2010-01-02", s"$SELECT_FROM_SRC UNION ALL 
$SELECT_FROM_SRC")
+  createPartition("2010-01-03", SELECT_FROM_SRC)
+
+  sql(s"ANALYZE TABLE $tableName PARTITION (ds='2010-01-01') COMPUTE 
STATISTICS NOSCAN")
+
+  sql(s"ANALYZE TABLE $tableName PARTITION (ds='2010-01-02') COMPUTE 
STATISTICS NOSCAN")
+
+  assert(queryStats("2010-01-01").rowCount === None)
+  assert(queryStats("2010-01-01").sizeInBytes === 2000)
+
+  assert(queryStats("2010-01-02").rowCount === None)
+  assert(queryStats("2010-01-02").sizeInBytes === 2*2000)
+
+  sql(s"ANALYZE TABLE $tableName PARTITION (ds='2010-01-01') COMPUTE 
STATISTICS")
+
+  sql(s"ANALYZE TABLE $tableName PARTITION (ds='2010-01-02') COMPUTE 
STATISTICS")
+
+  assert(queryStats("2010-01-01").rowCount.get === 500)
+  assert(queryStats("2010-01-01").sizeInBytes === 2000)
+
+  assert(queryStats("2010-01-02").rowCount.get === 2*500)
+  assert(queryStats("2010-01-02").sizeInBytes === 2*2000)
+}
+  }
+
+  test("analyze a set of partitions") {
+val tableName = "analyzeTable_part"
+
+def queryStats(ds: String, hr: String): Option[CatalogStatistics] = {
+  val tableId = TableIdentifier(tableName)
+  val partition =
+spark.sessionState.catalog.getPartition(tableId, Map("ds" -> ds, 
"hr" -> hr))
+  partition.stats
+}
+
+def assertStats(ds: String, hr: String, rowCount: BigInt, sizeInBytes: 
BigInt): Unit = {
+  val stats = queryStats(ds, hr).get
+  assert(stats.rowCount === Some(rowCount))
+  assert(stats.sizeInBytes === sizeInBytes)
+}
+
+def assertSizeInBytesStats(ds: String, hr: String, sizeInBytes: 
BigInt): Unit = {
--- End diff --

We can merge this method with `assertStats` by adding `rowCount: 
Option[BigInt]` as a paramter.
Also please rename `assertStats` to `assertPartitionStats`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18421: [SPARK-21213][SQL] Support collecting partition-l...

2017-07-07 Thread wzhfy

Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/18421#discussion_r126273439
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala ---
@@ -181,6 +182,151 @@ class StatisticsSuite extends 
StatisticsCollectionTestBase with TestHiveSingleto
 }
   }
 
+  private val SELECT_FROM_SRC = "SELECT '1', 'A' from src"
+
+  test("analyze single partition") {
+val tableName = "analyzeTable_part"
+
+def queryStats(ds: String): CatalogStatistics = {
+  val partition =
+
spark.sessionState.catalog.getPartition(TableIdentifier(tableName), Map("ds" -> 
ds))
+  partition.stats.get
+}
+
+def createPartition(ds: String, query: String): Unit = {
+  sql(s"INSERT INTO TABLE $tableName PARTITION (ds='$ds') $query")
+}
+
+withTable(tableName) {
+  sql(s"CREATE TABLE $tableName (key STRING, value STRING) PARTITIONED 
BY (ds STRING)")
+
+  createPartition("2010-01-01", SELECT_FROM_SRC)
+  createPartition("2010-01-02", s"$SELECT_FROM_SRC UNION ALL 
$SELECT_FROM_SRC")
+  createPartition("2010-01-03", SELECT_FROM_SRC)
+
+  sql(s"ANALYZE TABLE $tableName PARTITION (ds='2010-01-01') COMPUTE 
STATISTICS NOSCAN")
+
+  sql(s"ANALYZE TABLE $tableName PARTITION (ds='2010-01-02') COMPUTE 
STATISTICS NOSCAN")
+
+  assert(queryStats("2010-01-01").rowCount === None)
+  assert(queryStats("2010-01-01").sizeInBytes === 2000)
+
+  assert(queryStats("2010-01-02").rowCount === None)
+  assert(queryStats("2010-01-02").sizeInBytes === 2*2000)
+
+  sql(s"ANALYZE TABLE $tableName PARTITION (ds='2010-01-01') COMPUTE 
STATISTICS")
+
+  sql(s"ANALYZE TABLE $tableName PARTITION (ds='2010-01-02') COMPUTE 
STATISTICS")
+
+  assert(queryStats("2010-01-01").rowCount.get === 500)
+  assert(queryStats("2010-01-01").sizeInBytes === 2000)
+
+  assert(queryStats("2010-01-02").rowCount.get === 2*500)
+  assert(queryStats("2010-01-02").sizeInBytes === 2*2000)
+}
+  }
+
+  test("analyze a set of partitions") {
+val tableName = "analyzeTable_part"
+
+def queryStats(ds: String, hr: String): Option[CatalogStatistics] = {
+  val tableId = TableIdentifier(tableName)
+  val partition =
+spark.sessionState.catalog.getPartition(tableId, Map("ds" -> ds, 
"hr" -> hr))
+  partition.stats
+}
+
+def assertStats(ds: String, hr: String, rowCount: BigInt, sizeInBytes: 
BigInt): Unit = {
+  val stats = queryStats(ds, hr).get
+  assert(stats.rowCount === Some(rowCount))
+  assert(stats.sizeInBytes === sizeInBytes)
+}
+
+def assertSizeInBytesStats(ds: String, hr: String, sizeInBytes: 
BigInt): Unit = {
+  val stats = queryStats(ds, hr).get
+  assert(stats.rowCount === None)
+  assert(stats.sizeInBytes === sizeInBytes)
+}
+
+def createPartition(ds: String, hr: Int, query: String): Unit = {
+  sql(s"INSERT INTO TABLE $tableName PARTITION (ds='$ds', hr=$hr) 
$query")
+}
+
+withTable(tableName) {
+  sql(s"CREATE TABLE $tableName (key STRING, value STRING) PARTITIONED 
BY (ds STRING, hr INT)")
+
+  createPartition("2010-01-01", 10, SELECT_FROM_SRC)
+  createPartition("2010-01-01", 11, SELECT_FROM_SRC)
+  createPartition("2010-01-02", 10, SELECT_FROM_SRC)
+  createPartition("2010-01-02", 11, s"$SELECT_FROM_SRC UNION ALL 
$SELECT_FROM_SRC")
+
+  sql(s"ANALYZE TABLE $tableName PARTITION (ds='2010-01-01') COMPUTE 
STATISTICS NOSCAN")
+
+  assertSizeInBytesStats("2010-01-01", "10", 2000)
+  assertSizeInBytesStats("2010-01-01", "11", 2000)
+  assert(queryStats("2010-01-02", "10") === None)
+  assert(queryStats("2010-01-02", "11") === None)
+
+  sql(s"ANALYZE TABLE $tableName PARTITION (ds='2010-01-02') COMPUTE 
STATISTICS NOSCAN")
+
+  assertSizeInBytesStats("2010-01-01", "10", 2000)
+  assertSizeInBytesStats("2010-01-01", "11", 2000)
+  assertSizeInBytesStats("2010-01-02", "10", 2000)
+  assertSizeInBytesStats("2010-01-02", "11", 2*2000)
+
+  sql(s"ANALYZE TABLE $tableName PARTITION (ds='2010-01-01') COMPUTE 
STATISTICS")
+
+  assertStats("2010-01-01", "10", 500, 2000)
+  assertStats("2010-01-01", "11", 500, 2000)
+  assertSizeInBytesStats("2010-01-02", "10", 2000)
+  assertSizeInBytesStats("2010-01-02", "11", 2*2000)
+
+  sql(s"ANALYZE TABLE $tableName PARTITION (ds='2010-01-02') COMPUTE 
STATISTICS")
+
+  assertStats("2010-01-01", "10", 500, 2000)
+  assertStats("2010-01-01", "11", 500, 2000)
+

[GitHub] spark pull request #18421: [SPARK-21213][SQL] Support collecting partition-l...

2017-07-07 Thread wzhfy

Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/18421#discussion_r126273560
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala ---
@@ -181,6 +182,151 @@ class StatisticsSuite extends 
StatisticsCollectionTestBase with TestHiveSingleto
 }
   }
 
+  private val SELECT_FROM_SRC = "SELECT '1', 'A' from src"
--- End diff --

we can just use `SELECT '1', 'A'` and remove this val.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18421: [SPARK-21213][SQL] Support collecting partition-l...

2017-07-07 Thread wzhfy

Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/18421#discussion_r126273102
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala ---
@@ -93,30 +93,50 @@ class SparkSqlAstBuilder extends AstBuilder {
   }
 
   /**
-   * Create an [[AnalyzeTableCommand]] command or an 
[[AnalyzeColumnCommand]] command.
-   * Example SQL for analyzing table :
+   * Create an [[AnalyzeTableCommand]] command, or an 
[[AnalyzePartitionCommand]]
+   * or an [[AnalyzeColumnCommand]] command.
+   * Example SQL for analyzing table or a set of partitions :
* {{{
-   *   ANALYZE TABLE table COMPUTE STATISTICS [NOSCAN];
+   *   ANALYZE TABLE [db_name.]tablename [PARTITION (partcol1[=val1], 
partcol2[=val2], ...)]
+   *   COMPUTE STATISTICS [NOSCAN];
* }}}
+   *
* Example SQL for analyzing columns :
* {{{
-   *   ANALYZE TABLE table COMPUTE STATISTICS FOR COLUMNS column1, column2;
+   *   ANALYZE TABLE [db_name.]tablename COMPUTE STATISTICS FOR COLUMNS 
column1, column2;
* }}}
*/
   override def visitAnalyze(ctx: AnalyzeContext): LogicalPlan = 
withOrigin(ctx) {
-if (ctx.partitionSpec != null) {
-  logWarning(s"Partition specification is ignored: 
${ctx.partitionSpec.getText}")
+if (ctx.identifier != null &&
+ctx.identifier.getText.toLowerCase(Locale.ROOT) != "noscan") {
+  throw new ParseException(s"Expected `NOSCAN` instead of 
`${ctx.identifier.getText}`", ctx)
 }
-if (ctx.identifier != null) {
-  if (ctx.identifier.getText.toLowerCase(Locale.ROOT) != "noscan") {
-throw new ParseException(s"Expected `NOSCAN` instead of 
`${ctx.identifier.getText}`", ctx)
+
+val partitionSpec =
--- End diff --

`partitionSpec` ->`partitionValueSpec`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18421: [SPARK-21213][SQL] Support collecting partition-l...

2017-07-07 Thread wzhfy

Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/18421#discussion_r126273584
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala ---
@@ -181,6 +182,151 @@ class StatisticsSuite extends 
StatisticsCollectionTestBase with TestHiveSingleto
 }
   }
 
+  private val SELECT_FROM_SRC = "SELECT '1', 'A' from src"
+
+  test("analyze single partition") {
+val tableName = "analyzeTable_part"
+
+def queryStats(ds: String): CatalogStatistics = {
+  val partition =
+
spark.sessionState.catalog.getPartition(TableIdentifier(tableName), Map("ds" -> 
ds))
+  partition.stats.get
+}
+
+def createPartition(ds: String, query: String): Unit = {
+  sql(s"INSERT INTO TABLE $tableName PARTITION (ds='$ds') $query")
+}
+
+withTable(tableName) {
+  sql(s"CREATE TABLE $tableName (key STRING, value STRING) PARTITIONED 
BY (ds STRING)")
+
+  createPartition("2010-01-01", SELECT_FROM_SRC)
+  createPartition("2010-01-02", s"$SELECT_FROM_SRC UNION ALL 
$SELECT_FROM_SRC")
+  createPartition("2010-01-03", SELECT_FROM_SRC)
+
+  sql(s"ANALYZE TABLE $tableName PARTITION (ds='2010-01-01') COMPUTE 
STATISTICS NOSCAN")
+
+  sql(s"ANALYZE TABLE $tableName PARTITION (ds='2010-01-02') COMPUTE 
STATISTICS NOSCAN")
+
+  assert(queryStats("2010-01-01").rowCount === None)
+  assert(queryStats("2010-01-01").sizeInBytes === 2000)
+
+  assert(queryStats("2010-01-02").rowCount === None)
+  assert(queryStats("2010-01-02").sizeInBytes === 2*2000)
+
+  sql(s"ANALYZE TABLE $tableName PARTITION (ds='2010-01-01') COMPUTE 
STATISTICS")
+
+  sql(s"ANALYZE TABLE $tableName PARTITION (ds='2010-01-02') COMPUTE 
STATISTICS")
+
+  assert(queryStats("2010-01-01").rowCount.get === 500)
+  assert(queryStats("2010-01-01").sizeInBytes === 2000)
+
+  assert(queryStats("2010-01-02").rowCount.get === 2*500)
+  assert(queryStats("2010-01-02").sizeInBytes === 2*2000)
+}
+  }
+
+  test("analyze a set of partitions") {
+val tableName = "analyzeTable_part"
+
+def queryStats(ds: String, hr: String): Option[CatalogStatistics] = {
+  val tableId = TableIdentifier(tableName)
+  val partition =
+spark.sessionState.catalog.getPartition(tableId, Map("ds" -> ds, 
"hr" -> hr))
+  partition.stats
+}
+
+def assertStats(ds: String, hr: String, rowCount: BigInt, sizeInBytes: 
BigInt): Unit = {
+  val stats = queryStats(ds, hr).get
+  assert(stats.rowCount === Some(rowCount))
+  assert(stats.sizeInBytes === sizeInBytes)
+}
+
+def assertSizeInBytesStats(ds: String, hr: String, sizeInBytes: 
BigInt): Unit = {
+  val stats = queryStats(ds, hr).get
+  assert(stats.rowCount === None)
+  assert(stats.sizeInBytes === sizeInBytes)
+}
+
+def createPartition(ds: String, hr: Int, query: String): Unit = {
+  sql(s"INSERT INTO TABLE $tableName PARTITION (ds='$ds', hr=$hr) 
$query")
+}
+
+withTable(tableName) {
+  sql(s"CREATE TABLE $tableName (key STRING, value STRING) PARTITIONED 
BY (ds STRING, hr INT)")
+
+  createPartition("2010-01-01", 10, SELECT_FROM_SRC)
+  createPartition("2010-01-01", 11, SELECT_FROM_SRC)
+  createPartition("2010-01-02", 10, SELECT_FROM_SRC)
+  createPartition("2010-01-02", 11, s"$SELECT_FROM_SRC UNION ALL 
$SELECT_FROM_SRC")
+
+  sql(s"ANALYZE TABLE $tableName PARTITION (ds='2010-01-01') COMPUTE 
STATISTICS NOSCAN")
+
+  assertSizeInBytesStats("2010-01-01", "10", 2000)
+  assertSizeInBytesStats("2010-01-01", "11", 2000)
+  assert(queryStats("2010-01-02", "10") === None)
+  assert(queryStats("2010-01-02", "11") === None)
+
+  sql(s"ANALYZE TABLE $tableName PARTITION (ds='2010-01-02') COMPUTE 
STATISTICS NOSCAN")
+
+  assertSizeInBytesStats("2010-01-01", "10", 2000)
+  assertSizeInBytesStats("2010-01-01", "11", 2000)
+  assertSizeInBytesStats("2010-01-02", "10", 2000)
+  assertSizeInBytesStats("2010-01-02", "11", 2*2000)
+
+  sql(s"ANALYZE TABLE $tableName PARTITION (ds='2010-01-01') COMPUTE 
STATISTICS")
+
+  assertStats("2010-01-01", "10", 500, 2000)
+  assertStats("2010-01-01", "11", 500, 2000)
+  assertSizeInBytesStats("2010-01-02", "10", 2000)
+  assertSizeInBytesStats("2010-01-02", "11", 2*2000)
+
+  sql(s"ANALYZE TABLE $tableName PARTITION (ds='2010-01-02') COMPUTE 
STATISTICS")
+
+  assertStats("2010-01-01", "10", 500, 2000)
+  assertStats("2010-01-01", "11", 500, 2000)
+

[GitHub] spark pull request #18421: [SPARK-21213][SQL] Support collecting partition-l...

2017-07-07 Thread wzhfy

Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/18421#discussion_r126273555
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala ---
@@ -181,6 +182,151 @@ class StatisticsSuite extends 
StatisticsCollectionTestBase with TestHiveSingleto
 }
   }
 
+  private val SELECT_FROM_SRC = "SELECT '1', 'A' from src"
+
+  test("analyze single partition") {
+val tableName = "analyzeTable_part"
+
+def queryStats(ds: String): CatalogStatistics = {
+  val partition =
+
spark.sessionState.catalog.getPartition(TableIdentifier(tableName), Map("ds" -> 
ds))
+  partition.stats.get
+}
+
+def createPartition(ds: String, query: String): Unit = {
+  sql(s"INSERT INTO TABLE $tableName PARTITION (ds='$ds') $query")
+}
+
+withTable(tableName) {
+  sql(s"CREATE TABLE $tableName (key STRING, value STRING) PARTITIONED 
BY (ds STRING)")
+
+  createPartition("2010-01-01", SELECT_FROM_SRC)
+  createPartition("2010-01-02", s"$SELECT_FROM_SRC UNION ALL 
$SELECT_FROM_SRC")
+  createPartition("2010-01-03", SELECT_FROM_SRC)
--- End diff --

Since it's a test for `analyze single partition`, we only need one 
partition here, the others seems redundant.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18421: [SPARK-21213][SQL] Support collecting partition-l...

2017-07-07 Thread wzhfy

Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/18421#discussion_r126273031
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala ---
@@ -93,30 +93,50 @@ class SparkSqlAstBuilder extends AstBuilder {
   }
 
   /**
-   * Create an [[AnalyzeTableCommand]] command or an 
[[AnalyzeColumnCommand]] command.
-   * Example SQL for analyzing table :
+   * Create an [[AnalyzeTableCommand]] command, or an 
[[AnalyzePartitionCommand]]
+   * or an [[AnalyzeColumnCommand]] command.
+   * Example SQL for analyzing table or a set of partitions :
* {{{
-   *   ANALYZE TABLE table COMPUTE STATISTICS [NOSCAN];
+   *   ANALYZE TABLE [db_name.]tablename [PARTITION (partcol1[=val1], 
partcol2[=val2], ...)]
+   *   COMPUTE STATISTICS [NOSCAN];
* }}}
+   *
* Example SQL for analyzing columns :
* {{{
-   *   ANALYZE TABLE table COMPUTE STATISTICS FOR COLUMNS column1, column2;
+   *   ANALYZE TABLE [db_name.]tablename COMPUTE STATISTICS FOR COLUMNS 
column1, column2;
* }}}
*/
   override def visitAnalyze(ctx: AnalyzeContext): LogicalPlan = 
withOrigin(ctx) {
-if (ctx.partitionSpec != null) {
-  logWarning(s"Partition specification is ignored: 
${ctx.partitionSpec.getText}")
+if (ctx.identifier != null &&
+ctx.identifier.getText.toLowerCase(Locale.ROOT) != "noscan") {
+  throw new ParseException(s"Expected `NOSCAN` instead of 
`${ctx.identifier.getText}`", ctx)
 }
-if (ctx.identifier != null) {
-  if (ctx.identifier.getText.toLowerCase(Locale.ROOT) != "noscan") {
-throw new ParseException(s"Expected `NOSCAN` instead of 
`${ctx.identifier.getText}`", ctx)
+
+val partitionSpec =
+  if (ctx.partitionSpec != null) {
+val filteredSpec = 
visitPartitionSpec(ctx.partitionSpec).filter(_._2.isDefined)
+if (filteredSpec.isEmpty) {
+  None
+} else {
+  Some(filteredSpec.mapValues(_.get))
+}
+  } else {
+None
+  }
+
+val table = visitTableIdentifier(ctx.tableIdentifier)
+if (ctx.identifierSeq() == null) {
+  if (partitionSpec.isDefined) {
+AnalyzePartitionCommand(table, partitionSpec.get, noscan = 
ctx.identifier != null)
+  } else {
+AnalyzeTableCommand(table, noscan = ctx.identifier != null)
   }
-  AnalyzeTableCommand(visitTableIdentifier(ctx.tableIdentifier))
-} else if (ctx.identifierSeq() == null) {
-  AnalyzeTableCommand(visitTableIdentifier(ctx.tableIdentifier), 
noscan = false)
 } else {
+  if (partitionSpec.isDefined) {
+logWarning(s"Partition specification is ignored: 
${ctx.partitionSpec.getText}")
--- End diff --

Partition specification is ignored when collecting column statistics


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18421: [SPARK-21213][SQL] Support collecting partition-l...

2017-07-07 Thread wzhfy

Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/18421#discussion_r126273588
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala ---
@@ -181,6 +182,151 @@ class StatisticsSuite extends 
StatisticsCollectionTestBase with TestHiveSingleto
 }
   }
 
+  private val SELECT_FROM_SRC = "SELECT '1', 'A' from src"
+
+  test("analyze single partition") {
+val tableName = "analyzeTable_part"
+
+def queryStats(ds: String): CatalogStatistics = {
+  val partition =
+
spark.sessionState.catalog.getPartition(TableIdentifier(tableName), Map("ds" -> 
ds))
+  partition.stats.get
+}
+
+def createPartition(ds: String, query: String): Unit = {
+  sql(s"INSERT INTO TABLE $tableName PARTITION (ds='$ds') $query")
+}
+
+withTable(tableName) {
+  sql(s"CREATE TABLE $tableName (key STRING, value STRING) PARTITIONED 
BY (ds STRING)")
+
+  createPartition("2010-01-01", SELECT_FROM_SRC)
+  createPartition("2010-01-02", s"$SELECT_FROM_SRC UNION ALL 
$SELECT_FROM_SRC")
+  createPartition("2010-01-03", SELECT_FROM_SRC)
+
+  sql(s"ANALYZE TABLE $tableName PARTITION (ds='2010-01-01') COMPUTE 
STATISTICS NOSCAN")
+
+  sql(s"ANALYZE TABLE $tableName PARTITION (ds='2010-01-02') COMPUTE 
STATISTICS NOSCAN")
+
+  assert(queryStats("2010-01-01").rowCount === None)
+  assert(queryStats("2010-01-01").sizeInBytes === 2000)
+
+  assert(queryStats("2010-01-02").rowCount === None)
+  assert(queryStats("2010-01-02").sizeInBytes === 2*2000)
+
+  sql(s"ANALYZE TABLE $tableName PARTITION (ds='2010-01-01') COMPUTE 
STATISTICS")
+
+  sql(s"ANALYZE TABLE $tableName PARTITION (ds='2010-01-02') COMPUTE 
STATISTICS")
+
+  assert(queryStats("2010-01-01").rowCount.get === 500)
+  assert(queryStats("2010-01-01").sizeInBytes === 2000)
+
+  assert(queryStats("2010-01-02").rowCount.get === 2*500)
+  assert(queryStats("2010-01-02").sizeInBytes === 2*2000)
+}
+  }
+
+  test("analyze a set of partitions") {
+val tableName = "analyzeTable_part"
+
+def queryStats(ds: String, hr: String): Option[CatalogStatistics] = {
+  val tableId = TableIdentifier(tableName)
+  val partition =
+spark.sessionState.catalog.getPartition(tableId, Map("ds" -> ds, 
"hr" -> hr))
+  partition.stats
+}
+
+def assertStats(ds: String, hr: String, rowCount: BigInt, sizeInBytes: 
BigInt): Unit = {
+  val stats = queryStats(ds, hr).get
+  assert(stats.rowCount === Some(rowCount))
+  assert(stats.sizeInBytes === sizeInBytes)
+}
+
+def assertSizeInBytesStats(ds: String, hr: String, sizeInBytes: 
BigInt): Unit = {
+  val stats = queryStats(ds, hr).get
+  assert(stats.rowCount === None)
+  assert(stats.sizeInBytes === sizeInBytes)
+}
+
+def createPartition(ds: String, hr: Int, query: String): Unit = {
+  sql(s"INSERT INTO TABLE $tableName PARTITION (ds='$ds', hr=$hr) 
$query")
+}
+
+withTable(tableName) {
+  sql(s"CREATE TABLE $tableName (key STRING, value STRING) PARTITIONED 
BY (ds STRING, hr INT)")
+
+  createPartition("2010-01-01", 10, SELECT_FROM_SRC)
+  createPartition("2010-01-01", 11, SELECT_FROM_SRC)
+  createPartition("2010-01-02", 10, SELECT_FROM_SRC)
+  createPartition("2010-01-02", 11, s"$SELECT_FROM_SRC UNION ALL 
$SELECT_FROM_SRC")
+
+  sql(s"ANALYZE TABLE $tableName PARTITION (ds='2010-01-01') COMPUTE 
STATISTICS NOSCAN")
+
+  assertSizeInBytesStats("2010-01-01", "10", 2000)
+  assertSizeInBytesStats("2010-01-01", "11", 2000)
+  assert(queryStats("2010-01-02", "10") === None)
+  assert(queryStats("2010-01-02", "11") === None)
+
+  sql(s"ANALYZE TABLE $tableName PARTITION (ds='2010-01-02') COMPUTE 
STATISTICS NOSCAN")
+
+  assertSizeInBytesStats("2010-01-01", "10", 2000)
+  assertSizeInBytesStats("2010-01-01", "11", 2000)
+  assertSizeInBytesStats("2010-01-02", "10", 2000)
+  assertSizeInBytesStats("2010-01-02", "11", 2*2000)
+
+  sql(s"ANALYZE TABLE $tableName PARTITION (ds='2010-01-01') COMPUTE 
STATISTICS")
+
+  assertStats("2010-01-01", "10", 500, 2000)
--- End diff --

`assertStats("2010-01-01", "10", rowCount = 500, sizeInBytes = 2000)`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a

[GitHub] spark pull request #18421: [SPARK-21213][SQL] Support collecting partition-l...

2017-07-07 Thread wzhfy

Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/18421#discussion_r126273094
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala ---
@@ -93,30 +93,50 @@ class SparkSqlAstBuilder extends AstBuilder {
   }
 
   /**
-   * Create an [[AnalyzeTableCommand]] command or an 
[[AnalyzeColumnCommand]] command.
-   * Example SQL for analyzing table :
+   * Create an [[AnalyzeTableCommand]] command, or an 
[[AnalyzePartitionCommand]]
+   * or an [[AnalyzeColumnCommand]] command.
+   * Example SQL for analyzing table or a set of partitions :
* {{{
-   *   ANALYZE TABLE table COMPUTE STATISTICS [NOSCAN];
+   *   ANALYZE TABLE [db_name.]tablename [PARTITION (partcol1[=val1], 
partcol2[=val2], ...)]
+   *   COMPUTE STATISTICS [NOSCAN];
* }}}
+   *
* Example SQL for analyzing columns :
* {{{
-   *   ANALYZE TABLE table COMPUTE STATISTICS FOR COLUMNS column1, column2;
+   *   ANALYZE TABLE [db_name.]tablename COMPUTE STATISTICS FOR COLUMNS 
column1, column2;
* }}}
*/
   override def visitAnalyze(ctx: AnalyzeContext): LogicalPlan = 
withOrigin(ctx) {
-if (ctx.partitionSpec != null) {
-  logWarning(s"Partition specification is ignored: 
${ctx.partitionSpec.getText}")
+if (ctx.identifier != null &&
+ctx.identifier.getText.toLowerCase(Locale.ROOT) != "noscan") {
+  throw new ParseException(s"Expected `NOSCAN` instead of 
`${ctx.identifier.getText}`", ctx)
 }
-if (ctx.identifier != null) {
-  if (ctx.identifier.getText.toLowerCase(Locale.ROOT) != "noscan") {
-throw new ParseException(s"Expected `NOSCAN` instead of 
`${ctx.identifier.getText}`", ctx)
+
+val partitionSpec =
+  if (ctx.partitionSpec != null) {
+val filteredSpec = 
visitPartitionSpec(ctx.partitionSpec).filter(_._2.isDefined)
+if (filteredSpec.isEmpty) {
+  None
+} else {
+  Some(filteredSpec.mapValues(_.get))
+}
+  } else {
+None
+  }
+
+val table = visitTableIdentifier(ctx.tableIdentifier)
+if (ctx.identifierSeq() == null) {
+  if (partitionSpec.isDefined) {
+AnalyzePartitionCommand(table, partitionSpec.get, noscan = 
ctx.identifier != null)
+  } else {
+AnalyzeTableCommand(table, noscan = ctx.identifier != null)
   }
-  AnalyzeTableCommand(visitTableIdentifier(ctx.tableIdentifier))
-} else if (ctx.identifierSeq() == null) {
-  AnalyzeTableCommand(visitTableIdentifier(ctx.tableIdentifier), 
noscan = false)
 } else {
+  if (partitionSpec.isDefined) {
--- End diff --

should use `if (ctx.partitionSpec != null)`, currently we don't support 
partition-level column stats.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18421: [SPARK-21213][SQL] Support collecting partition-l...

2017-07-07 Thread wzhfy

Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/18421#discussion_r126273582
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala ---
@@ -181,6 +182,151 @@ class StatisticsSuite extends 
StatisticsCollectionTestBase with TestHiveSingleto
 }
   }
 
+  private val SELECT_FROM_SRC = "SELECT '1', 'A' from src"
+
+  test("analyze single partition") {
+val tableName = "analyzeTable_part"
+
+def queryStats(ds: String): CatalogStatistics = {
+  val partition =
+
spark.sessionState.catalog.getPartition(TableIdentifier(tableName), Map("ds" -> 
ds))
+  partition.stats.get
+}
+
+def createPartition(ds: String, query: String): Unit = {
+  sql(s"INSERT INTO TABLE $tableName PARTITION (ds='$ds') $query")
+}
+
+withTable(tableName) {
+  sql(s"CREATE TABLE $tableName (key STRING, value STRING) PARTITIONED 
BY (ds STRING)")
+
+  createPartition("2010-01-01", SELECT_FROM_SRC)
+  createPartition("2010-01-02", s"$SELECT_FROM_SRC UNION ALL 
$SELECT_FROM_SRC")
+  createPartition("2010-01-03", SELECT_FROM_SRC)
+
+  sql(s"ANALYZE TABLE $tableName PARTITION (ds='2010-01-01') COMPUTE 
STATISTICS NOSCAN")
+
+  sql(s"ANALYZE TABLE $tableName PARTITION (ds='2010-01-02') COMPUTE 
STATISTICS NOSCAN")
+
+  assert(queryStats("2010-01-01").rowCount === None)
+  assert(queryStats("2010-01-01").sizeInBytes === 2000)
+
+  assert(queryStats("2010-01-02").rowCount === None)
+  assert(queryStats("2010-01-02").sizeInBytes === 2*2000)
+
+  sql(s"ANALYZE TABLE $tableName PARTITION (ds='2010-01-01') COMPUTE 
STATISTICS")
+
+  sql(s"ANALYZE TABLE $tableName PARTITION (ds='2010-01-02') COMPUTE 
STATISTICS")
+
+  assert(queryStats("2010-01-01").rowCount.get === 500)
+  assert(queryStats("2010-01-01").sizeInBytes === 2000)
+
+  assert(queryStats("2010-01-02").rowCount.get === 2*500)
+  assert(queryStats("2010-01-02").sizeInBytes === 2*2000)
+}
+  }
+
+  test("analyze a set of partitions") {
+val tableName = "analyzeTable_part"
+
+def queryStats(ds: String, hr: String): Option[CatalogStatistics] = {
+  val tableId = TableIdentifier(tableName)
+  val partition =
+spark.sessionState.catalog.getPartition(tableId, Map("ds" -> ds, 
"hr" -> hr))
+  partition.stats
+}
+
+def assertStats(ds: String, hr: String, rowCount: BigInt, sizeInBytes: 
BigInt): Unit = {
+  val stats = queryStats(ds, hr).get
+  assert(stats.rowCount === Some(rowCount))
+  assert(stats.sizeInBytes === sizeInBytes)
+}
+
+def assertSizeInBytesStats(ds: String, hr: String, sizeInBytes: 
BigInt): Unit = {
+  val stats = queryStats(ds, hr).get
+  assert(stats.rowCount === None)
+  assert(stats.sizeInBytes === sizeInBytes)
+}
+
+def createPartition(ds: String, hr: Int, query: String): Unit = {
+  sql(s"INSERT INTO TABLE $tableName PARTITION (ds='$ds', hr=$hr) 
$query")
+}
+
+withTable(tableName) {
+  sql(s"CREATE TABLE $tableName (key STRING, value STRING) PARTITIONED 
BY (ds STRING, hr INT)")
+
+  createPartition("2010-01-01", 10, SELECT_FROM_SRC)
+  createPartition("2010-01-01", 11, SELECT_FROM_SRC)
+  createPartition("2010-01-02", 10, SELECT_FROM_SRC)
+  createPartition("2010-01-02", 11, s"$SELECT_FROM_SRC UNION ALL 
$SELECT_FROM_SRC")
+
+  sql(s"ANALYZE TABLE $tableName PARTITION (ds='2010-01-01') COMPUTE 
STATISTICS NOSCAN")
+
+  assertSizeInBytesStats("2010-01-01", "10", 2000)
+  assertSizeInBytesStats("2010-01-01", "11", 2000)
+  assert(queryStats("2010-01-02", "10") === None)
+  assert(queryStats("2010-01-02", "11") === None)
+
+  sql(s"ANALYZE TABLE $tableName PARTITION (ds='2010-01-02') COMPUTE 
STATISTICS NOSCAN")
+
+  assertSizeInBytesStats("2010-01-01", "10", 2000)
+  assertSizeInBytesStats("2010-01-01", "11", 2000)
+  assertSizeInBytesStats("2010-01-02", "10", 2000)
+  assertSizeInBytesStats("2010-01-02", "11", 2*2000)
--- End diff --

Checks for `ds='2010-01-02'` has no difference with `ds='2010-01-01'`, we 
can remove this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail:

[GitHub] spark pull request #18421: [SPARK-21213][SQL] Support collecting partition-l...

2017-07-07 Thread wzhfy

Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/18421#discussion_r126273260
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzeTableCommand.scala
 ---
@@ -72,3 +61,81 @@ case class AnalyzeTableCommand(
 Seq.empty[Row]
   }
 }
+
+/**
+ * Analyzes a given set of partitions to generate per-partition 
statistics, which will be used in
+ * query optimizations.
+ */
+case class AnalyzePartitionCommand(
--- End diff --

shall we create a new file for this class?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18300: [SPARK-21043][SQL] Add unionByName in Dataset

2017-07-07 Thread maropu

Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/18300#discussion_r126273573
  
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala 
---
@@ -112,6 +112,91 @@ class DataFrameSuite extends QueryTest with 
SharedSQLContext {
 )
   }
 
+  test("union by name") {
+// Simple case
+var df1 = Seq((1, 2, 3)).toDF("a", "b", "c")
+var df2 = Seq((3, 1, 2)).toDF("c", "a", "b")
+var df3 = Seq((2, 3, 1)).toDF("b", "c", "a")
+val unionDf = df1.unionByName(df2.unionByName(df3))
+checkAnswer(unionDf,
+  Row(1, 2, 3) :: Row(1, 2, 3) :: Row(1, 2, 3) :: Nil
+)
+
+// Check if adjacent unions are combined into a single one
+assert(unionDf.queryExecution.optimizedPlan.collect { case u: Union => 
true }.size == 1)
+
+// Check some type coercion
+df1 = Seq((1, "a")).toDF("c0", "c1")
+df2 = Seq((3, 1L)).toDF("c1", "c0")
+checkAnswer(df1.unionByName(df2), Row(1L, "a") :: Row(1L, "3") :: Nil)
+
+df1 = Seq((1, 1.0)).toDF("c0", "c1")
+df2 = Seq((8L, 3.0)).toDF("c1", "c0")
+checkAnswer(df1.unionByName(df2), Row(1.0, 1.0) :: Row(3.0, 8.0) :: 
Nil)
+
+df1 = Seq((2.0f, 7.4)).toDF("c0", "c1")
+df2 = Seq(("a", 4.0)).toDF("c1", "c0")
+checkAnswer(df1.unionByName(df2), Row(2.0, "7.4") :: Row(4.0, "a") :: 
Nil)
+
+df1 = Seq((1, "a", 3.0)).toDF("c0", "c1", "c2")
+df2 = Seq((1.2, 2, "bc")).toDF("c2", "c0", "c1")
+df3 = Seq(("def", 1.2, 3)).toDF("c1", "c2", "c0")
+checkAnswer(df1.unionByName(df2.unionByName(df3)),
+  Row(1, "a", 3.0) :: Row(2, "bc", 1.2) :: Row(3, "def", 1.2) :: Nil
+)
+
+// Failure cases
--- End diff --

ok


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18555: [Minor]add checkValue in spark.internal.config ab...

2017-07-07 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/18555#discussion_r126273432
  
--- Diff: 
core/src/main/scala/org/apache/spark/internal/config/package.scala ---
@@ -51,29 +63,66 @@ package object config {
 
ConfigBuilder(SparkLauncher.EXECUTOR_EXTRA_LIBRARY_PATH).stringConf.createOptional
 
   private[spark] val EXECUTOR_USER_CLASS_PATH_FIRST =
-
ConfigBuilder("spark.executor.userClassPathFirst").booleanConf.createWithDefault(false)
+ConfigBuilder("spark.executor.userClassPathFirst")
+  .doc("Same functionality as spark.driver.userClassPathFirst, " +
+"but applied to executor instances.")
+  .booleanConf
+  .checkValues(Set(false, true))
+  .createWithDefault(false)
 
   private[spark] val EXECUTOR_MEMORY = 
ConfigBuilder("spark.executor.memory")
+.doc("Amount of memory to use per executor process (e.g. 2g, 8g).")
 .bytesConf(ByteUnit.MiB)
+.checkValue(v => v > 0 && v <= Int.MaxValue / (1024 * 1024),
+  "The executor memory must be greater than 0M " +
+  s"and less than ${Int.MaxValue / (1024 * 1024)}M.")
 .createWithDefaultString("1g")
 
-  private[spark] val IS_PYTHON_APP = 
ConfigBuilder("spark.yarn.isPython").internal()
-.booleanConf.createWithDefault(false)
+  private[spark] val IS_PYTHON_APP = ConfigBuilder("spark.yarn.isPython")
+.internal()
+.booleanConf
+.checkValues(Set(false, true))
--- End diff --

ditto


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18555: [Minor]add checkValue in spark.internal.config ab...

2017-07-07 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/18555#discussion_r126273429
  
--- Diff: 
core/src/main/scala/org/apache/spark/internal/config/package.scala ---
@@ -51,29 +63,66 @@ package object config {
 
ConfigBuilder(SparkLauncher.EXECUTOR_EXTRA_LIBRARY_PATH).stringConf.createOptional
 
   private[spark] val EXECUTOR_USER_CLASS_PATH_FIRST =
-
ConfigBuilder("spark.executor.userClassPathFirst").booleanConf.createWithDefault(false)
+ConfigBuilder("spark.executor.userClassPathFirst")
+  .doc("Same functionality as spark.driver.userClassPathFirst, " +
+"but applied to executor instances.")
+  .booleanConf
+  .checkValues(Set(false, true))
--- End diff --

ditto


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18555: [Minor]add checkValue in spark.internal.config ab...

2017-07-07 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/18555#discussion_r126273417
  
--- Diff: 
core/src/main/scala/org/apache/spark/internal/config/package.scala ---
@@ -35,10 +35,22 @@ package object config {
 
ConfigBuilder(SparkLauncher.DRIVER_EXTRA_LIBRARY_PATH).stringConf.createOptional
 
   private[spark] val DRIVER_USER_CLASS_PATH_FIRST =
-
ConfigBuilder("spark.driver.userClassPathFirst").booleanConf.createWithDefault(false)
+ConfigBuilder("spark.driver.userClassPathFirst")
+  .doc("Whether to give user-added jars precedence over Spark's own 
jars " +
+"when loading classes in the driver. This feature can be used to 
mitigate " +
+"conflicts between Spark's dependencies and user dependencies. " +
+"It is currently an experimental feature. This is used in cluster 
mode only. ")
+  .booleanConf
+  .checkValues(Set(false, true))
+  .createWithDefault(false)
 
   private[spark] val DRIVER_MEMORY = ConfigBuilder("spark.driver.memory")
+.doc("Amount of memory to use for the driver process, i.e. " +
+  "where SparkContext is initialized. (e.g. 1g, 2g). ")
--- End diff --

nvm, we should not mention that and always ask users to add unit.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18459: [SPARK-13534][PYSPARK] Using Apache Arrow to increase pe...

2017-07-07 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18459
  
**[Test build #79363 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79363/testReport)**
 for PR 18459 at commit 
[`26dfc82`](https://github.com/apache/spark/commit/26dfc82896c3d2a5178b34e46162b3cc06ea8f2a).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18459: [SPARK-13534][PYSPARK] Using Apache Arrow to increase pe...

2017-07-07 Thread shaneknapp

Github user shaneknapp commented on the issue:

https://github.com/apache/spark/pull/18459
  
test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18555: [Minor]add checkValue in spark.internal.config about how...

2017-07-07 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18555
  
**[Test build #79361 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79361/testReport)**
 for PR 18555 at commit 
[`b19b378`](https://github.com/apache/spark/commit/b19b378ed72bfa910e889f23bc4f848324445d08).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18555: [Minor]add checkValue in spark.internal.config ab...

2017-07-07 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/18555#discussion_r126273242
  
--- Diff: 
core/src/main/scala/org/apache/spark/internal/config/package.scala ---
@@ -35,10 +35,22 @@ package object config {
 
ConfigBuilder(SparkLauncher.DRIVER_EXTRA_LIBRARY_PATH).stringConf.createOptional
 
   private[spark] val DRIVER_USER_CLASS_PATH_FIRST =
-
ConfigBuilder("spark.driver.userClassPathFirst").booleanConf.createWithDefault(false)
+ConfigBuilder("spark.driver.userClassPathFirst")
+  .doc("Whether to give user-added jars precedence over Spark's own 
jars " +
+"when loading classes in the driver. This feature can be used to 
mitigate " +
+"conflicts between Spark's dependencies and user dependencies. " +
+"It is currently an experimental feature. This is used in cluster 
mode only. ")
+  .booleanConf
+  .checkValues(Set(false, true))
+  .createWithDefault(false)
 
   private[spark] val DRIVER_MEMORY = ConfigBuilder("spark.driver.memory")
+.doc("Amount of memory to use for the driver process, i.e. " +
+  "where SparkContext is initialized. (e.g. 1g, 2g). ")
--- End diff --

we should also mention that, if no unit is given, e.g. `500`, it means 
500mb. We also need to update `configuration.md` for this conf.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18394: [SPARK-20379][core] Allow SSL config to reference env va...

2017-07-07 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18394
  
**[Test build #79362 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79362/testReport)**
 for PR 18394 at commit 
[`dc94c4b`](https://github.com/apache/spark/commit/dc94c4b9915bb9bef24a0200f5e2475221bbf656).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18555: [Minor]add checkValue in spark.internal.config ab...

2017-07-07 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/18555#discussion_r126273196
  
--- Diff: 
core/src/main/scala/org/apache/spark/internal/config/package.scala ---
@@ -35,10 +35,22 @@ package object config {
 
ConfigBuilder(SparkLauncher.DRIVER_EXTRA_LIBRARY_PATH).stringConf.createOptional
 
   private[spark] val DRIVER_USER_CLASS_PATH_FIRST =
-
ConfigBuilder("spark.driver.userClassPathFirst").booleanConf.createWithDefault(false)
+ConfigBuilder("spark.driver.userClassPathFirst")
+  .doc("Whether to give user-added jars precedence over Spark's own 
jars " +
+"when loading classes in the driver. This feature can be used to 
mitigate " +
+"conflicts between Spark's dependencies and user dependencies. " +
+"It is currently an experimental feature. This is used in cluster 
mode only. ")
+  .booleanConf
+  .checkValues(Set(false, true))
--- End diff --

boolean conf doesn't need `checkValue`...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18394: [SPARK-20379][core] Allow SSL config to reference env va...

2017-07-07 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/18394
  
LGTM, pending tests


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18394: [SPARK-20379][core] Allow SSL config to reference env va...

2017-07-07 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/18394
  
retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18555: [Minor]add checkValue in spark.internal.config about how...

2017-07-07 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/18555
  
ok to test


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18394: [SPARK-20379][core] Allow SSL config to reference...

2017-07-07 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/18394#discussion_r126273151
  
--- Diff: core/src/main/scala/org/apache/spark/SparkConf.scala ---
@@ -373,6 +373,11 @@ class SparkConf(loadDefaults: Boolean) extends 
Cloneable with Logging with Seria
 Option(settings.get(key)).orElse(getDeprecatedConfig(key, this))
   }
 
+  /** Get an optional value, applying variable substitution. */
+  private[spark] def getWithSubstitution(key: String): Option[String] = {
--- End diff --

makes sense


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18393: [SPARK-20342][core] Update task accumulators befo...

2017-07-07 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/18393#discussion_r126273069
  
--- Diff: 
core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala ---
@@ -2277,6 +2277,29 @@ class DAGSchedulerSuite extends SparkFunSuite with 
LocalSparkContext with Timeou
   (Success, 1)))
   }
 
+  test("task end event should have updated accumulators (SPARK-20342)") {
+val accumIds = new HashSet[Long]()
+val listener = new SparkListener() {
+  override def onTaskEnd(event: SparkListenerTaskEnd): Unit = {
+event.taskInfo.accumulables.foreach { acc => accumIds += acc.id }
+  }
+}
+sc.addSparkListener(listener)
+
+// Try a few times in a loop to make sure. This is not guaranteed to 
fail when the bug exists,
+// but it should at least make the test flaky. If the bug is fixed, 
this should always pass.
+(1 to 10).foreach { _ =>
+  accumIds.clear()
+
+  val accum = sc.longAccumulator
+  sc.parallelize(1 to 10, 10).foreach { _ =>
--- End diff --

The bug is, a task may lose the accumulator updates, but this test can only 
fail if all these 10 tasks lose the accumulator updates. Shall we use fewer 
partitions to make this test easier to fail?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18459: [SPARK-13534][PYSPARK] Using Apache Arrow to increase pe...

2017-07-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18459
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18459: [SPARK-13534][PYSPARK] Using Apache Arrow to increase pe...

2017-07-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18459
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/79355/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18459: [SPARK-13534][PYSPARK] Using Apache Arrow to increase pe...

2017-07-07 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18459
  
**[Test build #79355 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79355/testReport)**
 for PR 18459 at commit 
[`26dfc82`](https://github.com/apache/spark/commit/26dfc82896c3d2a5178b34e46162b3cc06ea8f2a).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18468: [SPARK-20873][SQL] Enhance ColumnVector to support compr...

2017-07-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18468
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/79359/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18468: [SPARK-20873][SQL] Enhance ColumnVector to support compr...

2017-07-07 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18468
  
**[Test build #79359 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79359/testReport)**
 for PR 18468 at commit 
[`f657fa8`](https://github.com/apache/spark/commit/f657fa8556b179d3586031a431f8065073927158).
 * This patch **fails to build**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18468: [SPARK-20873][SQL] Enhance ColumnVector to support compr...

2017-07-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18468
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18292: [SPARK-21083][SQL] Do not remove correct stats when re-a...

2017-07-07 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18292
  
**[Test build #79360 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79360/testReport)**
 for PR 18292 at commit 
[`b41adec`](https://github.com/apache/spark/commit/b41adeca65d05bb8de192969bcd0f28a0b4ee750).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

1 2 3 4 5 >

1 - 100 of 466 matches

Mail list logo