[GitHub] spark issue #15729: [SPARK-18133] [branch-2.0] [Examples] [ML] [Python ML Pi...

2016-11-01 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15729
  
**[Test build #67958 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67958/consoleFull)**
 for PR 15729 at commit 
[`dadfaa9`](https://github.com/apache/spark/commit/dadfaa95bb72f867b0c7e06909e68cea53be5a93).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15697: [SparkR][Test]:remove unnecessary suppressWarnings

2016-11-01 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/15697
  
For the specific failure here, it was due to not syncing the latest version 
between the releases for Winodws and the others (when R 3.3.2 was released). As 
we use a fixed version, R 3.3.1 now (not the latest but old one), we wouldn't 
meet the same case for the same reason in the future (I described this problem 
in more details in https://github.com/apache/spark/pull/15709). 

However, I guess you meant the concern of re-triggering the spurious 
failures in general. For re-triggering tests, I am still looking into this 
deeper. Will make a INFRA JIRA as soon as I figure out the way committers can 
re-trigger. Current ways I have figured out so far as a (ugly) workaround are, 
- Use another AppVeyor account
- Closing and then re-opening to launch another build.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15677: [SPARK-17963][SQL][Documentation] Add examples (extend) ...

2016-11-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15677
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/67946/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15677: [SPARK-17963][SQL][Documentation] Add examples (extend) ...

2016-11-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15677
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15729: [SPARK-18133] [branch-2.0] [Examples] [ML] [Pytho...

2016-11-01 Thread jagadeesanas2
GitHub user jagadeesanas2 opened a pull request:

https://github.com/apache/spark/pull/15729

[SPARK-18133] [branch-2.0] [Examples] [ML] [Python ML Pipeline Exampl…

## What changes were proposed in this pull request?

[Fix] [branch-2.0] In Python 3, there is only one integer type (i.e., int), 
which mostly behaves like the long type in Python 2. Since Python 3 won't 
accept "L", so removed "L" in all examples.

## How was this patch tested?

Unit tests.

…e has syntax errors]

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ibmsoe/spark SPARK-18133_branch2.0

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/15729.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #15729


commit dadfaa95bb72f867b0c7e06909e68cea53be5a93
Author: Jagadeesan 
Date:   2016-11-02T05:47:49Z

[SPARK-18133] [branch-2.0] [Examples] [ML] [Python ML Pipeline Example has 
syntax errors]




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15677: [SPARK-17963][SQL][Documentation] Add examples (extend) ...

2016-11-01 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15677
  
**[Test build #67946 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67946/consoleFull)**
 for PR 15677 at commit 
[`a6b50eb`](https://github.com/apache/spark/commit/a6b50ebafb01edceca1fc8a729177cdb87da5e20).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15567: [SPARK-14393][SQL] values generated by non-deterministic...

2016-11-01 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15567
  
**[Test build #67957 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67957/consoleFull)**
 for PR 15567 at commit 
[`553c6a5`](https://github.com/apache/spark/commit/553c6a543dd18a7278bf989e9197e74dc3cece7c).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15314: [SPARK-17747][ML] WeightCol support non-double datatypes

2016-11-01 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15314
  
**[Test build #67956 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67956/consoleFull)**
 for PR 15314 at commit 
[`77bc5ab`](https://github.com/apache/spark/commit/77bc5ab725ce36b5a9bac6b9cf33ce2ee01bd131).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15725: [SPARK-18167] Print out spark confs, and hive confs when...

2016-11-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15725
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15700: [SPARK-17964][SparkR] Enable SparkR with Mesos client mo...

2016-11-01 Thread felixcheung
Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/15700
  
This might not handle packages the best it can on mesos but that could be a 
follow-up
@sun-rui 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15725: [SPARK-18167] Print out spark confs, and hive confs when...

2016-11-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15725
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/67950/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15705: [SPARK-18183] [SPARK-18184] Fix INSERT [INTO|OVERWRITE] ...

2016-11-01 Thread ericl
Github user ericl commented on the issue:

https://github.com/apache/spark/pull/15705
  
I think you don't have to since this is just the test suite.

On Tue, Nov 1, 2016, 8:49 PM Wenchen Fan  wrote:

> *@cloud-fan* commented on this pull request.
> --
>
> In
> 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/parser/PlanParserSuite.scala
> :
>
> > @@ -180,7 +180,16 @@ class PlanParserSuite extends PlanTest {
>  partition: Map[String, Option[String]],
>  overwrite: Boolean = false,
>  ifNotExists: Boolean = false): LogicalPlan =
> -  InsertIntoTable(table("s"), partition, plan, overwrite, 
ifNotExists)
> +  InsertIntoTable(
> +table("s"), partition, plan,
> +OverwriteOptions(
> +  overwrite,
> +  if (overwrite && partition.nonEmpty) {
> +Some(partition.map(kv => (kv._1, kv._2.get)))
>
> do we need to consider dynamic partition here?
>
> —
> You are receiving this because you authored the thread.
> Reply to this email directly, view it on GitHub
> ,
> or mute the thread
> 

> .
>



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15725: [SPARK-18167] Print out spark confs, and hive confs when...

2016-11-01 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15725
  
**[Test build #67950 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67950/consoleFull)**
 for PR 15725 at commit 
[`0dca3ac`](https://github.com/apache/spark/commit/0dca3ac4dec6efbc3a52a6c995bc43296358a449).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15704: [SPARK-17732][SQL] ALTER TABLE DROP PARTITION sho...

2016-11-01 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/15704#discussion_r86076296
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala 
---
@@ -226,6 +227,63 @@ class HiveDDLSuite
 }
   }
 
+  test("SPARK-17732: Drop partitions by filter") {
+withTable("sales") {
+  sql("CREATE TABLE sales(id INT) PARTITIONED BY (country STRING, 
quarter STRING)")
+
+  for (country <- Seq("US", "CA", "KR")) {
+for (quarter <- 1 to 4) {
+  sql(s"ALTER TABLE sales ADD PARTITION (country='$country', 
quarter='$quarter')")
+}
+  }
+
+  sql("ALTER TABLE sales DROP PARTITION (country < 'KR')")
+  checkAnswer(sql("SHOW PARTITIONS sales"),
+Row("country=KR/quarter=1") ::
+Row("country=KR/quarter=2") ::
+Row("country=KR/quarter=3") ::
+Row("country=KR/quarter=4") ::
+Row("country=US/quarter=1") ::
+Row("country=US/quarter=2") ::
+Row("country=US/quarter=3") ::
+Row("country=US/quarter=4") :: Nil)
+
+  sql("ALTER TABLE sales DROP PARTITION (quarter <= '2')")
+  checkAnswer(sql("SHOW PARTITIONS sales"),
+Row("country=KR/quarter=3") ::
+Row("country=KR/quarter=4") ::
+Row("country=US/quarter=3") ::
+Row("country=US/quarter=4") :: Nil)
+
+  sql("ALTER TABLE sales DROP PARTITION (country='KR', quarter='4')")
+  sql("ALTER TABLE sales DROP PARTITION (country='US', quarter='3')")
--- End diff --

Let's add a test for dropping multiple partition specs?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15728: [SPARK-18133] [branch-2.0] [Examples] [ML] [Pytho...

2016-11-01 Thread jagadeesanas2
Github user jagadeesanas2 closed the pull request at:

https://github.com/apache/spark/pull/15728


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15728: [SPARK-18133] [branch-2.0] [Examples] [ML] [Python ML Pi...

2016-11-01 Thread felixcheung
Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/15728
  
could you close this and create a PR against branch-2.0 instead of master?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15704: [SPARK-17732][SQL] ALTER TABLE DROP PARTITION sho...

2016-11-01 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/15704#discussion_r86075964
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala ---
@@ -418,27 +419,53 @@ case class AlterTableRenamePartitionCommand(
  */
 case class AlterTableDropPartitionCommand(
 tableName: TableIdentifier,
-specs: Seq[TablePartitionSpec],
+specs: Seq[Expression],
 ifExists: Boolean,
 purge: Boolean)
-  extends RunnableCommand {
+  extends RunnableCommand with PredicateHelper {
+
+  private def hasComplexExpr(expr: Expression): Boolean = {
+expr.find(e => e.isInstanceOf[BinaryComparison] && 
!e.isInstanceOf[EqualTo]).isDefined
+  }
 
   override def run(sparkSession: SparkSession): Seq[Row] = {
 val catalog = sparkSession.sessionState.catalog
 val table = catalog.getTableMetadata(tableName)
+val resolver = sparkSession.sessionState.conf.resolver
 DDLUtils.verifyAlterTableType(catalog, table, isView = false)
 DDLUtils.verifyPartitionProviderIsHive(sparkSession, table, "ALTER 
TABLE DROP PARTITION")
 
-val normalizedSpecs = specs.map { spec =>
-  PartitioningUtils.normalizePartitionSpec(
-spec,
-table.partitionColumnNames,
-table.identifier.quotedString,
-sparkSession.sessionState.conf.resolver)
+specs.flatMap(splitConjunctivePredicates).map {
+  case BinaryComparison(AttributeReference(key, _, _, _), _) =>
+table.partitionColumnNames.find(resolver(_, key)).getOrElse {
+  throw new AnalysisException(
+s"$key is not a valid partition column in table 
${table.identifier.quotedString}.")
+}
 }
 
-catalog.dropPartitions(
-  table.identifier, normalizedSpecs, ignoreIfNotExists = ifExists, 
purge = purge)
+if (specs.exists(hasComplexExpr)) {
+  val partitions = catalog.listPartitionsByFilter(table.identifier, 
specs)
+  if (partitions.nonEmpty) {
+catalog.dropPartitions(
+  table.identifier, partitions.map(_.spec), ignoreIfNotExists = 
ifExists, purge = purge)
+  } else if (!ifExists) {
+throw new AnalysisException(specs.toString)
--- End diff --

This might be not clear enough. Add a short error message?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15314: [SPARK-17747][ML] WeightCol support non-double datatypes

2016-11-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15314
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15704: [SPARK-17732][SQL] ALTER TABLE DROP PARTITION sho...

2016-11-01 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/15704#discussion_r86075872
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala ---
@@ -418,27 +419,53 @@ case class AlterTableRenamePartitionCommand(
  */
 case class AlterTableDropPartitionCommand(
 tableName: TableIdentifier,
-specs: Seq[TablePartitionSpec],
+specs: Seq[Expression],
 ifExists: Boolean,
 purge: Boolean)
-  extends RunnableCommand {
+  extends RunnableCommand with PredicateHelper {
+
+  private def hasComplexExpr(expr: Expression): Boolean = {
+expr.find(e => e.isInstanceOf[BinaryComparison] && 
!e.isInstanceOf[EqualTo]).isDefined
+  }
 
   override def run(sparkSession: SparkSession): Seq[Row] = {
 val catalog = sparkSession.sessionState.catalog
 val table = catalog.getTableMetadata(tableName)
+val resolver = sparkSession.sessionState.conf.resolver
 DDLUtils.verifyAlterTableType(catalog, table, isView = false)
 DDLUtils.verifyPartitionProviderIsHive(sparkSession, table, "ALTER 
TABLE DROP PARTITION")
 
-val normalizedSpecs = specs.map { spec =>
-  PartitioningUtils.normalizePartitionSpec(
-spec,
-table.partitionColumnNames,
-table.identifier.quotedString,
-sparkSession.sessionState.conf.resolver)
+specs.flatMap(splitConjunctivePredicates).map {
--- End diff --

I think we don't need to do `splitConjunctivePredicates`. Just iterates 
each attribute in every spec expression's `references` and do the following 
resolving check, should be enough.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15314: [SPARK-17747][ML] WeightCol support non-double datatypes

2016-11-01 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15314
  
**[Test build #67955 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67955/consoleFull)**
 for PR 15314 at commit 
[`88fea8e`](https://github.com/apache/spark/commit/88fea8e751c89273a6e6de7ddd89548cc27b5c5b).
 * This patch **fails Scala style tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15314: [SPARK-17747][ML] WeightCol support non-double datatypes

2016-11-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15314
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/67955/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15314: [SPARK-17747][ML] WeightCol support non-double datatypes

2016-11-01 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15314
  
**[Test build #67955 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67955/consoleFull)**
 for PR 15314 at commit 
[`88fea8e`](https://github.com/apache/spark/commit/88fea8e751c89273a6e6de7ddd89548cc27b5c5b).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15484: [SPARK-17868][SQL] Do not use bitmasks during par...

2016-11-01 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/15484#discussion_r86075526
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -229,10 +235,14 @@ class Analyzer(
  *  Group Count: 2 ^ N (N is the number of group expressions)
  *
  *  We need to get all of its subsets for a given GROUPBY expression, 
the subsets are
- *  represented as the bit masks.
+ *  represented as sequence of expressions.
  */
-def bitmasks(c: Cube): Seq[Int] = {
-  Seq.tabulate(1 << c.groupByExprs.length)(i => i)
+def cubeExprs(exprs: Seq[Expression]): Seq[Seq[Expression]] = 
exprs.toList match {
--- End diff --

Also I think you can just use subsets? e.g.

```
scala> Seq(1, 2, 3).toSet.subsets.foreach(println)
Set()
Set(1)
Set(2)
Set(3)
Set(1, 2)
Set(1, 3)
Set(2, 3)
Set(1, 2, 3)
```



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15484: [SPARK-17868][SQL] Do not use bitmasks during par...

2016-11-01 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/15484#discussion_r86075344
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -229,10 +235,14 @@ class Analyzer(
  *  Group Count: 2 ^ N (N is the number of group expressions)
  *
  *  We need to get all of its subsets for a given GROUPBY expression, 
the subsets are
- *  represented as the bit masks.
+ *  represented as sequence of expressions.
  */
-def bitmasks(c: Cube): Seq[Int] = {
-  Seq.tabulate(1 << c.groupByExprs.length)(i => i)
+def cubeExprs(exprs: Seq[Expression]): Seq[Seq[Expression]] = 
exprs.toList match {
--- End diff --

I'd also write unit tests specifically for cubeExprs and rollupExprs


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15484: [SPARK-17868][SQL] Do not use bitmasks during par...

2016-11-01 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/15484#discussion_r86075252
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -216,10 +216,16 @@ class Analyzer(
  *  Group Count: N + 1 (N is the number of group expressions)
  *
  *  We need to get all of its subsets for the rule described above, 
the subset is
- *  represented as the bit masks.
+ *  represented as sequence of expressions.
  */
-def bitmasks(r: Rollup): Seq[Int] = {
-  Seq.tabulate(r.groupByExprs.length + 1)(idx => (1 << idx) - 1)
+def rollupExprs(exprs: Seq[Expression]): Seq[Seq[Expression]] = {
+  val buffer = ArrayBuffer.empty[Seq[Expression]]
--- End diff --

Is this just `exprs.inits` ?

to be honest this is the first time I've seen the use of init/inits on a 
trait. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15024: [SPARK-17470][SQL] unify path for data source tab...

2016-11-01 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/15024#discussion_r86075163
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/sources/PathOptionSuite.scala ---
@@ -0,0 +1,97 @@
+/*
+* Licensed to the Apache Software Foundation (ASF) under one or more
+* contributor license agreements.  See the NOTICE file distributed with
+* this work for additional information regarding copyright ownership.
+* The ASF licenses this file to You under the Apache License, Version 2.0
+* (the "License"); you may not use this file except in compliance with
+* the License.  You may obtain a copy of the License at
+*
+*http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+
+package org.apache.spark.sql.sources
+
+import org.apache.spark.sql.{DataFrame, SaveMode, SparkSession, SQLContext}
+import org.apache.spark.sql.catalyst.TableIdentifier
+import org.apache.spark.sql.execution.datasources.LogicalRelation
+import org.apache.spark.sql.test.SharedSQLContext
+import org.apache.spark.sql.types.StructType
+
+class TestOptionsSource extends RelationProvider with 
CreatableRelationProvider {
+
+  override def createRelation(
+  sqlContext: SQLContext,
+  parameters: Map[String, String]): BaseRelation = {
+new TestOptionsRelation(parameters)(sqlContext.sparkSession)
+  }
+
+  override def createRelation(
+  sqlContext: SQLContext,
+  mode: SaveMode,
+  parameters: Map[String, String],
+  data: DataFrame): BaseRelation = {
+new TestOptionsRelation(parameters)(sqlContext.sparkSession)
+  }
+}
+
+class TestOptionsRelation(val options: Map[String, String])(@transient val 
session: SparkSession)
+  extends BaseRelation {
+
+  override def sqlContext: SQLContext = session.sqlContext
+
+  override def schema: StructType = new StructType().add("i", "int")
+}
+
+class PathOptionSuite extends DataSourceTest with SharedSQLContext {
+
+  test("path option always exist") {
+withTable("src") {
+  sql(
+s"""
+   |CREATE TABLE src(i int)
+   |USING ${classOf[TestOptionsSource].getCanonicalName}
+   |OPTIONS (PATH '/tmp/path')""".stripMargin)
+  assert(getPathOption("src") == Some("/tmp/path"))
+}
+
+// should exist even path option is not specified when creating table
+withTable("src") {
+  sql(s"CREATE TABLE src(i int) USING 
${classOf[TestOptionsSource].getCanonicalName}")
+  assert(getPathOption("src") == Some(defaultTablePath("src")))
+}
+  }
+
+  test("path option always represent the value of table location") {
+withTable("src") {
+  sql(
+s"""
+   |CREATE TABLE src(i int)
+   |USING ${classOf[TestOptionsSource].getCanonicalName}
+   |OPTIONS (PATH '/tmp/path')""".stripMargin)
+  sql("ALTER TABLE src SET LOCATION '/tmp/path2'")
+  assert(getPathOption("src") == Some("/tmp/path2"))
+}
+
+withTable("src", "src2") {
+  sql(s"CREATE TABLE src(i int) USING 
${classOf[TestOptionsSource].getCanonicalName}")
+  sql("ALTER TABLE src RENAME TO src2")
--- End diff --

This test case is still calling the `InMemoryCatalog.renameTable`. Thus, we 
still need a test case to verify the behavior of 
`HiveExternalCatalog.renameTable`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15704: [SPARK-17732][SQL] ALTER TABLE DROP PARTITION sho...

2016-11-01 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/15704#discussion_r86075107
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala ---
@@ -418,27 +419,53 @@ case class AlterTableRenamePartitionCommand(
  */
 case class AlterTableDropPartitionCommand(
 tableName: TableIdentifier,
-specs: Seq[TablePartitionSpec],
+specs: Seq[Expression],
 ifExists: Boolean,
 purge: Boolean)
-  extends RunnableCommand {
+  extends RunnableCommand with PredicateHelper {
+
+  private def hasComplexExpr(expr: Expression): Boolean = {
--- End diff --

The function name looks confusing. Actually they are not more complex 
operators, are they?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15702: [SPARK-18124] Observed delay based Event Time Wat...

2016-11-01 Thread CodingCat
Github user CodingCat commented on a diff in the pull request:

https://github.com/apache/spark/pull/15702#discussion_r86075063
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StatefulAggregate.scala
 ---
@@ -104,85 +110,105 @@ case class StateStoreSaveExec(
 
   override protected def doExecute(): RDD[InternalRow] = {
 metrics // force lazy init at driver
-assert(returnAllStates.nonEmpty,
-  "Incorrect planning in IncrementalExecution, returnAllStates have 
not been set")
-val saveAndReturnFunc = if (returnAllStates.get) saveAndReturnAll _ 
else saveAndReturnUpdated _
+assert(outputMode.nonEmpty,
+  "Incorrect planning in IncrementalExecution, outputMode has not been 
set")
+
 child.execute().mapPartitionsWithStateStore(
   getStateId.checkpointLocation,
   operatorId = getStateId.operatorId,
   storeVersion = getStateId.batchId,
   keyExpressions.toStructType,
   child.output.toStructType,
   sqlContext.sessionState,
-  Some(sqlContext.streams.stateStoreCoordinator)
-)(saveAndReturnFunc)
+  Some(sqlContext.streams.stateStoreCoordinator)) { (store, iter) =>
+val getKey = GenerateUnsafeProjection.generate(keyExpressions, 
child.output)
+val numOutputRows = longMetric("numOutputRows")
+val numTotalStateRows = longMetric("numTotalStateRows")
+val numUpdatedStateRows = longMetric("numUpdatedStateRows")
+
+outputMode match {
+  // Update and output all rows in the StateStore.
+  case Some(Complete) =>
+while (iter.hasNext) {
+  val row = iter.next().asInstanceOf[UnsafeRow]
+  val key = getKey(row)
+  store.put(key.copy(), row.copy())
+  numUpdatedStateRows += 1
+}
+store.commit()
+numTotalStateRows += store.numKeys()
+store.iterator().map { case (k, v) =>
+  numOutputRows += 1
+  v.asInstanceOf[InternalRow]
+}
+
+  // Update and output only rows being evicted from the StateStore
+  case Some(Append) =>
+while (iter.hasNext) {
+  val row = iter.next().asInstanceOf[UnsafeRow]
+  val key = getKey(row)
+  store.put(key.copy(), row.copy())
+  numUpdatedStateRows += 1
+}
+
+val watermarkAttribute =
+  
keyExpressions.find(_.metadata.contains(EventTimeWatermark.delayKey)).get
+// If we are evicting based on a window, use the end of the 
window.  Otherwise just
+// use the attribute itself.
+val evictionExpression =
+  if (watermarkAttribute.dataType.isInstanceOf[StructType]) {
+LessThanOrEqual(
+  GetStructField(watermarkAttribute, 1),
+  Literal(eventTimeWatermark.get * 1000))
+  } else {
+LessThanOrEqual(
+  watermarkAttribute,
+  Literal(eventTimeWatermark.get * 1000))
+  }
+
+logInfo(s"Filtering state store on: $evictionExpression")
+val predicate = newPredicate(evictionExpression, 
keyExpressions)
+store.remove(predicate)
+
+store.commit()
+
+numTotalStateRows += store.numKeys()
+store.updates().filter(_.isInstanceOf[ValueRemoved]).map { 
removed =>
+  numOutputRows += 1
+  removed.value.asInstanceOf[InternalRow]
+}
+
+  // Update and output modified rows from the StateStore.
+  case Some(Update) =>
--- End diff --

@koeninger, Update shall allow the late data to correct the previous 
results even they are late than the threshold, the similar implementation is in 
http://cdn.oreillystatic.com/en/assets/1/event/160/Triggers%20in%20Apache%20Beam%20_incubating_%20Presentation.pdf
 (search 'elementCountAtLeast')...correct me if I was wrong


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15673: [SPARK-17992][SQL] Return all partitions from Hiv...

2016-11-01 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/15673


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15697: [SparkR][Test]:remove unnecessary suppressWarnings

2016-11-01 Thread felixcheung
Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/15697
  
@HyukjinKwon the time window when download of an older version for windows 
could fail seem concerning. Is there a way to address that?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15673: [SPARK-17992][SQL] Return all partitions from HiveShim w...

2016-11-01 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/15673
  
Merging in master. Thanks.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15608: [SPARK-17838][SparkR] Check named arguments for o...

2016-11-01 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/15608


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15704: [SPARK-17732][SQL] ALTER TABLE DROP PARTITION sho...

2016-11-01 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/15704#discussion_r86074588
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala
 ---
@@ -204,6 +207,38 @@ class AstBuilder extends SqlBaseBaseVisitor[AnyRef] 
with Logging {
   }
 
   /**
+   * Create a partition filter specification.
+   */
+  def visitPartitionFilterSpec(ctx: PartitionSpecContext): Expression = 
withOrigin(ctx) {
+val parts = ctx.partitionVal.asScala.map { pVal =>
+  val name = pVal.identifier.getText
+  val operator = Option(pVal.comparisonOperator).map(_.getText)
+  if (operator.isDefined) {
+val left = AttributeReference(name, DataTypes.StringType)()
+val right = expression(pVal.constant)
+val operator = 
pVal.comparisonOperator().getChild(0).asInstanceOf[TerminalNode]
+operator.getSymbol.getType match {
+  case SqlBaseParser.EQ =>
+EqualTo(left, right)
+  case SqlBaseParser.NEQ | SqlBaseParser.NEQJ =>
+Not(EqualTo(left, right))
+  case SqlBaseParser.LT =>
+LessThan(left, right)
+  case SqlBaseParser.LTE =>
+LessThanOrEqual(left, right)
+  case SqlBaseParser.GT =>
+GreaterThan(left, right)
+  case SqlBaseParser.GTE =>
+GreaterThanOrEqual(left, right)
--- End diff --

Failed to match `SqlBaseParser.NSEQ` might cause runtime error.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15673: [SPARK-17992][SQL] Return all partitions from HiveShim w...

2016-11-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15673
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15608: [SPARK-17838][SparkR] Check named arguments for options ...

2016-11-01 Thread felixcheung
Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/15608
  
merged to master.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15673: [SPARK-17992][SQL] Return all partitions from HiveShim w...

2016-11-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15673
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/67949/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15673: [SPARK-17992][SQL] Return all partitions from HiveShim w...

2016-11-01 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15673
  
**[Test build #67949 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67949/consoleFull)**
 for PR 15673 at commit 
[`8d468ac`](https://github.com/apache/spark/commit/8d468ac7097de56989deee124ce65a6583f8eaa8).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15659: [SPARK-1267][SPARK-18129] Allow PySpark to be pip instal...

2016-11-01 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15659
  
**[Test build #67954 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67954/consoleFull)**
 for PR 15659 at commit 
[`7af912a`](https://github.com/apache/spark/commit/7af912a30d6472684838b8ff424495d28d845682).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15659: [SPARK-1267][SPARK-18129] Allow PySpark to be pip...

2016-11-01 Thread holdenk
Github user holdenk commented on a diff in the pull request:

https://github.com/apache/spark/pull/15659#discussion_r86073517
  
--- Diff: python/setup.py ---
@@ -0,0 +1,170 @@
+#!/usr/bin/env python
+
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import print_function
+import glob
+import os
+import sys
+from setuptools import setup, find_packages
+from shutil import copyfile, copytree, rmtree
+
+exec(open('pyspark/version.py').read())
+VERSION = __version__
+# A temporary path so we can access above the Python project root and 
fetch scripts and jars we need
+TEMP_PATH = "deps"
+SPARK_HOME = os.path.abspath("../")
+JARS_PATH = "%s/assembly/target/scala-2.11/jars/" % SPARK_HOME
+
+# Use the release jars path if we are in release mode.
+if (os.path.isfile("../RELEASE") and 
len(glob.glob("../jars/spark*core*.jar")) == 1):
+JARS_PATH = "%s/jars/" % SPARK_HOME
+
+EXAMPLES_PATH = "%s/examples/src/main/python" % SPARK_HOME
+SCRIPTS_PATH = "%s/bin" % SPARK_HOME
+SCRIPTS_TARGET = "%s/bin" % TEMP_PATH
+JARS_TARGET = "%s/jars" % TEMP_PATH
+EXAMPLES_TARGET = "%s/examples" % TEMP_PATH
+
+if sys.version_info < (2, 7):
+print("Python versions prior to 2.7 are not supported.", 
file=sys.stderr)
+exit(-1)
+
+# Check and see if we are under the spark path in which case we need to 
build the symlink farm.
+# This is important because we only want to build the symlink farm while 
under Spark otherwise we
+# want to use the symlink farm. And if the symlink farm exists under while 
under Spark (e.g. a
+# partially built sdist) we should error and have the user sort it out.
+in_spark = 
(os.path.isfile("../core/src/main/scala/org/apache/spark/SparkContext.scala") or
--- End diff --

I mean we haven't moved SparkContext.scala and it's the clearest thing 
which will only exist in Spark. I'm open to checking for something else if 
people have a suggestion.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15659: [SPARK-1267][SPARK-18129] Allow PySpark to be pip...

2016-11-01 Thread holdenk
Github user holdenk commented on a diff in the pull request:

https://github.com/apache/spark/pull/15659#discussion_r86073446
  
--- Diff: python/setup.py ---
@@ -0,0 +1,180 @@
+#!/usr/bin/env python
+
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import print_function
+import glob
+import os
+import sys
+from setuptools import setup, find_packages
+from shutil import copyfile, copytree, rmtree
+
+try:
+exec(open('pyspark/version.py').read())
+except IOError:
+print("Failed to load PySpark version file for packaging you must be 
in Spark's python dir.",
+  file=sys.stderr)
+sys.exit(-1)
+VERSION = __version__
+# A temporary path so we can access above the Python project root and 
fetch scripts and jars we need
+TEMP_PATH = "deps"
+SPARK_HOME = os.path.abspath("../")
+JARS_PATH = "%s/assembly/target/scala-2.11/jars/" % SPARK_HOME
+
+# Use the release jars path if we are in release mode.
+if (os.path.isfile("../RELEASE") and 
len(glob.glob("../jars/spark*core*.jar")) == 1):
+JARS_PATH = "%s/jars/" % SPARK_HOME
+
+EXAMPLES_PATH = "%s/examples/src/main/python" % SPARK_HOME
+SCRIPTS_PATH = "%s/bin" % SPARK_HOME
+SCRIPTS_TARGET = "%s/bin" % TEMP_PATH
+JARS_TARGET = "%s/jars" % TEMP_PATH
+EXAMPLES_TARGET = "%s/examples" % TEMP_PATH
+
+if sys.version_info < (2, 7):
--- End diff --

I can move it up, not quite to the top (need sys import of course first)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15659: [SPARK-1267][SPARK-18129] Allow PySpark to be pip...

2016-11-01 Thread holdenk
Github user holdenk commented on a diff in the pull request:

https://github.com/apache/spark/pull/15659#discussion_r86073257
  
--- Diff: python/pyspark/version.py ---
@@ -0,0 +1,19 @@
+#!/usr/bin/env python
+
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+__version__ = '2.1.0.dev1'
--- End diff --

well we want to indicate its the equivalent to snapshot, and following 
PEP440 (which we need to do for eventual PyPI publishing) we swapped `SNAPSHOT` 
to `dev`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15659: [SPARK-1267][SPARK-18129] Allow PySpark to be pip...

2016-11-01 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/15659#discussion_r86073111
  
--- Diff: python/pyspark/version.py ---
@@ -0,0 +1,19 @@
+#!/usr/bin/env python
+
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+__version__ = '2.1.0.dev1'
--- End diff --

does this need to go as "2.1.0"?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15659: [SPARK-1267][SPARK-18129] Allow PySpark to be pip instal...

2016-11-01 Thread felixcheung
Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/15659
  
This looks to be low-ish risk given it is mostly build & script changes.
It would be great to get more python developers to review and test this out.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15659: [SPARK-1267][SPARK-18129] Allow PySpark to be pip...

2016-11-01 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/15659#discussion_r86072893
  
--- Diff: dev/run-tests.py ---
@@ -583,6 +589,7 @@ def main():
 modules_with_python_tests = [m for m in test_modules if 
m.python_test_goals]
 if modules_with_python_tests:
 run_python_tests(modules_with_python_tests, opts.parallelism)
+run_python_packaging_tests()
--- End diff --

I would +1 on this given the logic in setup.py that should be checked


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15659: [SPARK-1267][SPARK-18129] Allow PySpark to be pip...

2016-11-01 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/15659#discussion_r86072643
  
--- Diff: python/setup.py ---
@@ -0,0 +1,180 @@
+#!/usr/bin/env python
+
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import print_function
+import glob
+import os
+import sys
+from setuptools import setup, find_packages
+from shutil import copyfile, copytree, rmtree
+
+try:
+exec(open('pyspark/version.py').read())
+except IOError:
+print("Failed to load PySpark version file for packaging you must be 
in Spark's python dir.",
+  file=sys.stderr)
+sys.exit(-1)
+VERSION = __version__
+# A temporary path so we can access above the Python project root and 
fetch scripts and jars we need
+TEMP_PATH = "deps"
+SPARK_HOME = os.path.abspath("../")
+JARS_PATH = "%s/assembly/target/scala-2.11/jars/" % SPARK_HOME
+
+# Use the release jars path if we are in release mode.
+if (os.path.isfile("../RELEASE") and 
len(glob.glob("../jars/spark*core*.jar")) == 1):
+JARS_PATH = "%s/jars/" % SPARK_HOME
+
+EXAMPLES_PATH = "%s/examples/src/main/python" % SPARK_HOME
+SCRIPTS_PATH = "%s/bin" % SPARK_HOME
+SCRIPTS_TARGET = "%s/bin" % TEMP_PATH
+JARS_TARGET = "%s/jars" % TEMP_PATH
+EXAMPLES_TARGET = "%s/examples" % TEMP_PATH
+
+if sys.version_info < (2, 7):
--- End diff --

nit: move this to the top of the file?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15726: [SPARK-18107][SQL][FOLLOW-UP] Insert overwrite statement...

2016-11-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15726
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/67945/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15726: [SPARK-18107][SQL][FOLLOW-UP] Insert overwrite statement...

2016-11-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15726
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15726: [SPARK-18107][SQL][FOLLOW-UP] Insert overwrite statement...

2016-11-01 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15726
  
**[Test build #67945 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67945/consoleFull)**
 for PR 15726 at commit 
[`eae8f1a`](https://github.com/apache/spark/commit/eae8f1ad1d8240c236a73066610747f3e7ef3669).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15723: [SPARK-18214][SQL] Simplify RuntimeReplaceable type coer...

2016-11-01 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15723
  
**[Test build #67953 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67953/consoleFull)**
 for PR 15723 at commit 
[`a5c0b16`](https://github.com/apache/spark/commit/a5c0b16c4e4d11af0ffc7691ab40a58aeec29e78).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15723: [SPARK-18214][SQL] Simplify RuntimeReplaceable type coer...

2016-11-01 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15723
  
**[Test build #3393 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3393/consoleFull)**
 for PR 15723 at commit 
[`a5c0b16`](https://github.com/apache/spark/commit/a5c0b16c4e4d11af0ffc7691ab40a58aeec29e78).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15671: [SPARK-18206][ML]Add instrumentation logs to ML training...

2016-11-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15671
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/67947/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15671: [SPARK-18206][ML]Add instrumentation logs to ML training...

2016-11-01 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15671
  
**[Test build #67947 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67947/consoleFull)**
 for PR 15671 at commit 
[`6d2d13f`](https://github.com/apache/spark/commit/6d2d13f79ae68e71d023e7c79d19586842d49c75).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15671: [SPARK-18206][ML]Add instrumentation logs to ML training...

2016-11-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15671
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15724: [SPARK-18216][SQL] Make Column.expr public

2016-11-01 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/15724


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15660: [SPARK-18133][Examples] [ML] [Python ML Pipeline Example...

2016-11-01 Thread jagadeesanas2
Github user jagadeesanas2 commented on the issue:

https://github.com/apache/spark/pull/15660
  
@yanboliang PR for branch-2.0 https://github.com/apache/spark/pull/15728


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15711: [SPARK-18192] Support all file formats in structured str...

2016-11-01 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15711
  
**[Test build #67952 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67952/consoleFull)**
 for PR 15711 at commit 
[`29071bb`](https://github.com/apache/spark/commit/29071bb77de5604926be69c61a75922ce9cd797e).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15693: [SPARK-18125][SQL] Fix a compilation error in codegen du...

2016-11-01 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/15693
  
@hvanhovell any more thoughts?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15724: [SPARK-18216][SQL] Make Column.expr public

2016-11-01 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/15724
  
Thanks - merging in master.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15711: [SPARK-18192] Support all file formats in structured str...

2016-11-01 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15711
  
**[Test build #3392 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3392/consoleFull)**
 for PR 15711 at commit 
[`29071bb`](https://github.com/apache/spark/commit/29071bb77de5604926be69c61a75922ce9cd797e).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15711: [SPARK-18192] Support all file formats in structured str...

2016-11-01 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/15711
  
@marmbrus I ended up adding test cases since they were trivial to add (and 
an existing test case was asserting that text would not be supported).



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15728: [SPARK-18133] [branch-2.0] [Examples] [ML] [Python ML Pi...

2016-11-01 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15728
  
**[Test build #67951 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67951/consoleFull)**
 for PR 15728 at commit 
[`6b7796f`](https://github.com/apache/spark/commit/6b7796f4d6bf41f31601403b6f3903cc16b46fb4).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15024: [SPARK-17470][SQL] unify path for data source tab...

2016-11-01 Thread yhuai
Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/15024#discussion_r86070024
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala ---
@@ -91,7 +73,8 @@ case class CreateTableLikeCommand(
   CatalogTable(
 identifier = targetTable,
 tableType = CatalogTableType.MANAGED,
-storage = newStorage,
+// We are creating a new managed table, which should not have 
custom table location.
+storage = sourceTableDesc.storage.copy(locationUri = None),
--- End diff --

When will we set the location? Is it set by hive metastore?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15024: [SPARK-17470][SQL] unify path for data source tab...

2016-11-01 Thread yhuai
Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/15024#discussion_r86070455
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala ---
@@ -207,6 +207,9 @@ private[spark] class HiveExternalCatalog(conf: 
SparkConf, hadoopConf: Configurat
   if (tableDefinition.partitionProviderIsHive) {
 tableProperties.put(TABLE_PARTITION_PROVIDER, "hive")
   }
+  tableDefinition.storage.locationUri.foreach { location =>
+tableProperties.put(TABLE_LOCATION, location)
--- End diff --

why do we need this? Why not just use `path` in serde properties?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15024: [SPARK-17470][SQL] unify path for data source tab...

2016-11-01 Thread yhuai
Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/15024#discussion_r86069832
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala
 ---
@@ -85,14 +86,7 @@ case class CreateDataSourceTableCommand(table: 
CatalogTable, ignoreIfExists: Boo
   }
 }
 
-val optionsWithPath = if (table.tableType == CatalogTableType.MANAGED) 
{
-  table.storage.properties + ("path" -> 
sessionState.catalog.defaultTablePath(table.identifier))
--- End diff --

where do we assign the default location?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15024: [SPARK-17470][SQL] unify path for data source tab...

2016-11-01 Thread yhuai
Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/15024#discussion_r86069964
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala ---
@@ -665,15 +665,7 @@ case class AlterTableSetLocationCommand(
 catalog.alterPartitions(tableName, Seq(newPart))
   case None =>
 // No partition spec is specified, so we set the location for the 
table itself
-val newTable =
-  if (DDLUtils.isDatasourceTable(table)) {
-table.withNewStorage(
-  locationUri = Some(location),
-  properties = table.storage.properties ++ Map("path" -> 
location))
-  } else {
-table.withNewStorage(locationUri = Some(location))
-  }
-catalog.alterTable(newTable)
+catalog.alterTable(table.withNewStorage(locationUri = 
Some(location)))
--- End diff --

Do we still have this issue?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15024: [SPARK-17470][SQL] unify path for data source tab...

2016-11-01 Thread yhuai
Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/15024#discussion_r86070624
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala ---
@@ -383,8 +389,22 @@ private[spark] class HiveExternalCatalog(conf: 
SparkConf, hadoopConf: Configurat
   }
 
   override def renameTable(db: String, oldName: String, newName: String): 
Unit = withClient {
-val newTable = client.getTable(db, oldName)
-  .copy(identifier = TableIdentifier(newName, Some(db)))
+val rawTable = client.getTable(db, oldName)
+
+val tableProps = if (rawTable.tableType == MANAGED) {
+  // If it's a managed table and we are renaming it, then the 
TABLE_LOCATION property becomes
+  // inaccurate as Hive metastore will generate a new table location 
in the `locationUri` field.
+  // Here we remove the TABLE_LOCATION property, so that we can read 
the value of `locationUri`
+  // field and treat it as table location when we read this table 
later.
+  rawTable.properties - TABLE_LOCATION
+} else {
+  rawTable.properties
+}
+
+val newTable = rawTable.copy(
+  identifier = TableIdentifier(newName, Some(db)),
+  properties = tableProps)
+
--- End diff --

I am not sure if I am following at here. So, after rename, we will not have 
a table property representing the location?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15024: [SPARK-17470][SQL] unify path for data source tab...

2016-11-01 Thread yhuai
Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/15024#discussion_r86070959
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala ---
@@ -513,6 +555,16 @@ private[spark] class HiveExternalCatalog(conf: 
SparkConf, hadoopConf: Configurat
 tableWithStats.copy(properties = getOriginalTableProperties(table))
   }
 
+  private def getLocationFromRawTable(rawTable: CatalogTable): 
Option[String] = {
+rawTable.properties.get(TABLE_LOCATION).orElse {
+  // In older version of spark, we store the table location in storage 
properties with key
+  // `path`, instead of table properties with key 
`spark.sql.tableLocation`. We should
--- End diff --

Why do we need `spark.sql.tableLocation` instead of just relying on hive's 
location field and path in serde properties?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15024: [SPARK-17470][SQL] unify path for data source tab...

2016-11-01 Thread yhuai
Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/15024#discussion_r86070329
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/sources/PathOptionSuite.scala ---
@@ -0,0 +1,97 @@
+/*
+* Licensed to the Apache Software Foundation (ASF) under one or more
+* contributor license agreements.  See the NOTICE file distributed with
+* this work for additional information regarding copyright ownership.
+* The ASF licenses this file to You under the Apache License, Version 2.0
+* (the "License"); you may not use this file except in compliance with
+* the License.  You may obtain a copy of the License at
+*
+*http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+
+package org.apache.spark.sql.sources
+
+import org.apache.spark.sql.{DataFrame, SaveMode, SparkSession, SQLContext}
+import org.apache.spark.sql.catalyst.TableIdentifier
+import org.apache.spark.sql.execution.datasources.LogicalRelation
+import org.apache.spark.sql.test.SharedSQLContext
+import org.apache.spark.sql.types.StructType
+
+class TestOptionsSource extends RelationProvider with 
CreatableRelationProvider {
+
+  override def createRelation(
+  sqlContext: SQLContext,
+  parameters: Map[String, String]): BaseRelation = {
+new TestOptionsRelation(parameters)(sqlContext.sparkSession)
--- End diff --

Can we also add comment to explain which tests exercise this method?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15024: [SPARK-17470][SQL] unify path for data source tab...

2016-11-01 Thread yhuai
Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/15024#discussion_r86070568
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala ---
@@ -259,10 +266,9 @@ private[spark] class HiveExternalCatalog(conf: 
SparkConf, hadoopConf: Configurat
 val location = if (tableDefinition.tableType == EXTERNAL) {
   // When we hit this branch, we are saving an external data 
source table with hive
   // compatible format, which means the data source is file-based 
and must have a `path`.
-  val map = new 
CaseInsensitiveMap(tableDefinition.storage.properties)
-  require(map.contains("path"),
+  require(tableDefinition.storage.locationUri.isDefined,
 "External file-based data source table must have a `path` 
entry in storage properties.")
-  Some(new Path(map("path")).toUri.toString)
+  Some(new 
Path(tableDefinition.storage.locationUri.get).toUri.toString)
--- End diff --

This part looks weird since we already have a location uri.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15024: [SPARK-17470][SQL] unify path for data source tab...

2016-11-01 Thread yhuai
Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/15024#discussion_r86070169
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
 ---
@@ -541,3 +434,123 @@ case class DataSource(
 }
   }
 }
+
+object DataSource {
+
+  /** A map to maintain backward compatibility in case we move data 
sources around. */
+  private val backwardCompatibilityMap: Map[String, String] = {
+val jdbc = classOf[JdbcRelationProvider].getCanonicalName
+val json = classOf[JsonFileFormat].getCanonicalName
+val parquet = classOf[ParquetFileFormat].getCanonicalName
+val csv = classOf[CSVFileFormat].getCanonicalName
+val libsvm = "org.apache.spark.ml.source.libsvm.LibSVMFileFormat"
+val orc = "org.apache.spark.sql.hive.orc.OrcFileFormat"
+
+Map(
+  "org.apache.spark.sql.jdbc" -> jdbc,
+  "org.apache.spark.sql.jdbc.DefaultSource" -> jdbc,
+  "org.apache.spark.sql.execution.datasources.jdbc.DefaultSource" -> 
jdbc,
+  "org.apache.spark.sql.execution.datasources.jdbc" -> jdbc,
+  "org.apache.spark.sql.json" -> json,
+  "org.apache.spark.sql.json.DefaultSource" -> json,
+  "org.apache.spark.sql.execution.datasources.json" -> json,
+  "org.apache.spark.sql.execution.datasources.json.DefaultSource" -> 
json,
+  "org.apache.spark.sql.parquet" -> parquet,
+  "org.apache.spark.sql.parquet.DefaultSource" -> parquet,
+  "org.apache.spark.sql.execution.datasources.parquet" -> parquet,
+  "org.apache.spark.sql.execution.datasources.parquet.DefaultSource" 
-> parquet,
+  "org.apache.spark.sql.hive.orc.DefaultSource" -> orc,
+  "org.apache.spark.sql.hive.orc" -> orc,
+  "org.apache.spark.ml.source.libsvm.DefaultSource" -> libsvm,
+  "org.apache.spark.ml.source.libsvm" -> libsvm,
+  "com.databricks.spark.csv" -> csv
+)
+  }
+
+  /**
+   * Class that were removed in Spark 2.0. Used to detect incompatibility 
libraries for Spark 2.0.
+   */
+  private val spark2RemovedClasses = Set(
+"org.apache.spark.sql.DataFrame",
+"org.apache.spark.sql.sources.HadoopFsRelationProvider",
+"org.apache.spark.Logging")
+
+  /** Given a provider name, look up the data source class definition. */
+  def lookupDataSource(provider0: String): Class[_] = {
--- End diff --

why not just provider?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15024: [SPARK-17470][SQL] unify path for data source tab...

2016-11-01 Thread yhuai
Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/15024#discussion_r86066888
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/InMemoryCatalog.scala
 ---
@@ -196,18 +196,32 @@ class InMemoryCatalog(
 throw new TableAlreadyExistsException(db = db, table = table)
   }
 } else {
-  if (tableDefinition.tableType == CatalogTableType.MANAGED) {
-val dir = new Path(catalog(db).db.locationUri, table)
+  // Set the default table location if this is a managed table and its 
location is not
+  // specified.
+  // Ideally we should not create a managed table with location, but 
Hive serde table can
+  // specify location for managed table. And in 
[[CreateDataSourceTableAsSelectCommand]] we have
+  // to create the table directory and write out data before we create 
this table, to avoid
+  // exposing a partial written table.
+  val needDefaultTableLocation =
+  tableDefinition.tableType == CatalogTableType.MANAGED &&
--- End diff --

indentation


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15728: [SPARK-18133] [branch-2.0] [Examples] [ML] [Pytho...

2016-11-01 Thread jagadeesanas2
GitHub user jagadeesanas2 opened a pull request:

https://github.com/apache/spark/pull/15728

[SPARK-18133] [branch-2.0] [Examples] [ML] [Python ML Pipeline Example has 
syntax errors]

## What changes were proposed in this pull request?

[Fix] [branch-2.0] In Python 3, there is only one integer type (i.e., int), 
which mostly behaves like the long type in Python 2. Since Python 3 won't 
accept "L", so removed "L" in all examples.

## How was this patch tested?

Unit tests.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ibmsoe/spark SPARK-18133_2.0

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/15728.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #15728


commit 191d99692dc4315c371b566e3a9c5b114876ee49
Author: Wenchen Fan 
Date:   2016-09-01T00:54:59Z

[SPARK-17180][SPARK-17309][SPARK-17323][SQL][2.0] create AlterViewAsCommand 
to handle ALTER VIEW AS

## What changes were proposed in this pull request?

Currently we use `CreateViewCommand` to implement ALTER VIEW AS, which has 
3 bugs:

1. SPARK-17180: ALTER VIEW AS should alter temp view if view name has no 
database part and temp view exists
2. SPARK-17309: ALTER VIEW AS should issue exception if view does not exist.
3. SPARK-17323: ALTER VIEW AS should keep the previous table properties, 
comment, create_time, etc.

The root cause is, ALTER VIEW AS is quite different from CREATE VIEW, we 
need different code path to handle them. However, in `CreateViewCommand`, there 
is no way to distinguish ALTER VIEW AS and CREATE VIEW, we have to introduce 
extra flag. But instead of doing this, I think a more natural way is to 
separate the ALTER VIEW AS logic into a new command.

backport https://github.com/apache/spark/pull/14874 to 2.0

## How was this patch tested?

new tests in SQLViewSuite

Author: Wenchen Fan 

Closes #14893 from cloud-fan/minor4.

commit 8711b451d727074173748418a47cec210f84f2f7
Author: Junyang Qian 
Date:   2016-09-01T04:28:53Z

[SPARKR][MINOR] Fix windowPartitionBy example

## What changes were proposed in this pull request?

The usage in the original example is incorrect. This PR fixes it.

## How was this patch tested?

Manual test.

Author: Junyang Qian 

Closes #14903 from junyangq/SPARKR-FixWindowPartitionByDoc.

(cherry picked from commit d008638fbedc857c1adc1dff399d427b8bae848e)
Signed-off-by: Shivaram Venkataraman 

commit 6281b74b6965ffcd0600844cea168cbe71ca8248
Author: Shixiong Zhu 
Date:   2016-09-01T06:25:20Z

[SPARK-17318][TESTS] Fix ReplSuite replicating blocks of object with class 
defined in repl again

## What changes were proposed in this pull request?

After digging into the logs, I noticed the failure is because in this test, 
it starts a local cluster with 2 executors. However, when SparkContext is 
created, executors may be still not up. When one of the executor is not up 
during running the job, the blocks won't be replicated.

This PR just adds a wait loop before running the job to fix the flaky test.

## How was this patch tested?

Jenkins

Author: Shixiong Zhu 

Closes #14905 from zsxwing/SPARK-17318-2.

(cherry picked from commit 21c0a4fe9d8e21819ba96e7dc2b1f2999d3299ae)
Signed-off-by: Shixiong Zhu 

commit 13bacd7308c42c92f42fbc3ffbee9a13282668a9
Author: Tejas Patil 
Date:   2016-09-01T16:49:43Z

[SPARK-17271][SQL] Planner adds un-necessary Sort even if child orde…

## What changes were proposed in this pull request?

Ports https://github.com/apache/spark/pull/14841 and 
https://github.com/apache/spark/pull/14910 from `master` to `branch-2.0`

Jira : https://issues.apache.org/jira/browse/SPARK-17271

Planner is adding un-needed SORT operation due to bug in the way comparison 
for `SortOrder` is done at 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/EnsureRequirements.scala#L253
`SortOrder` needs to be compared semantically because `Expression` within 
two `SortOrder` can be "semantically equal" but not literally equal objects.

eg. In case of `sql("SELECT * FROM table1 a JOIN table2 b ON 
a.col1=b.col1")`

Expression in required SortOrder:
```
  AttributeReference(
name = "col1",
dataType = LongType,
nullable = false
  ) (exprId = exprId,

[GitHub] spark issue #15725: [SPARK-18167] Print out spark confs, and hive confs when...

2016-11-01 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15725
  
**[Test build #67950 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67950/consoleFull)**
 for PR 15725 at commit 
[`0dca3ac`](https://github.com/apache/spark/commit/0dca3ac4dec6efbc3a52a6c995bc43296358a449).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15725: [SPARK-18167] Print out spark confs, and hive confs when...

2016-11-01 Thread JoshRosen
Github user JoshRosen commented on the issue:

https://github.com/apache/spark/pull/15725
  
Jenkins, retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15668: [SPARK-18137][SQL]Fix RewriteDistinctAggregates Unresolv...

2016-11-01 Thread windpiger
Github user windpiger commented on the issue:

https://github.com/apache/spark/pull/15668
  
@cloud-fan yes,It is simpler,I check the query plan, my test case 
`sql("SELECTpercentile_approx(key, 0.9)" +
   ", count(distinct key),sum(distinct key) FROM src LIMIT 1")`
explain is ok as follows:
`+- *Expand [List(null, null, 0, cast(key#29 as double)), List(key#29, 
null, 1, null), List(null, cast(key#29 as bigint), 2, null)], [src.`key`#38, 
CAST(src.`key` AS BIGINT)#39L, gid#37, CAST(src.`key` AS DOUBLE)#40]
   +- HiveTableScan [key#29], MetastoreRelation 
default, src`

But, I miss one case that the distinct on a constant like:
`sql("SELECTpercentile_approx(key, 0.9)" +
   ", count(distinct key),sum(distinct key),count(distinct 1) FROM src 
LIMIT 1")`
explain with a literal expand as follows:
`+- *Expand [List(null, null, null, 0, cast(key#69 as double)), List(1, 
null, null, 1, null), List(null, key#69, null, 2, null), List(null, null, 
cast(key#69 as bigint), 3, null)], [src.`key`#80, CAST(src.`key` AS 
BIGINT)#81L, gid#79, CAST(src.`key` AS DOUBLE)#82]
   +- HiveTableScan [key#69], MetastoreRelation 
default, src`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15692: [SPARK-18177][ML][PYSPARK] Add missing 'subsamplingRate'...

2016-11-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15692
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/67948/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15692: [SPARK-18177][ML][PYSPARK] Add missing 'subsamplingRate'...

2016-11-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15692
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15692: [SPARK-18177][ML][PYSPARK] Add missing 'subsamplingRate'...

2016-11-01 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15692
  
**[Test build #67948 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67948/consoleFull)**
 for PR 15692 at commit 
[`9e97eef`](https://github.com/apache/spark/commit/9e97eefcdf0950e81cc68a6a19f232f00042f473).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15727: [SPARK-17895] Improve doc for rangeBetween and rowsBetwe...

2016-11-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15727
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15727: [SPARK-17895] Improve doc for rangeBetween and ro...

2016-11-01 Thread david-weiluo-ren
GitHub user david-weiluo-ren opened a pull request:

https://github.com/apache/spark/pull/15727

[SPARK-17895] Improve doc for rangeBetween and rowsBetween

## What changes were proposed in this pull request?

Copied description for row and range based frame boundary from 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/window/WindowExec.scala#L56

Added examples to show different behavior of rangeBetween and rowsBetween 
when involving duplicate values.



Please review 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark before 
opening a pull request.



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/david-weiluo-ren/spark 
improveDocForRangeAndRowsBetween

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/15727.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #15727


commit 2e74dba1a1c62f251ee1ef6411bd37bf20e94ca0
Author: buzhihuojie 
Date:   2016-11-02T03:50:55Z

improve doc for rangeBetween and rowsBetween




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15703: [SPARK-18186] Migrate HiveUDAFFunction to TypedImperativ...

2016-11-01 Thread liancheng
Github user liancheng commented on the issue:

https://github.com/apache/spark/pull/15703
  
OK, now it's ready for review and merge.

cc @yhuai @JoshRosen @cloud-fan 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15705: [SPARK-18183] [SPARK-18184] Fix INSERT [INTO|OVER...

2016-11-01 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/15705#discussion_r86069348
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/parser/PlanParserSuite.scala
 ---
@@ -180,7 +180,16 @@ class PlanParserSuite extends PlanTest {
 partition: Map[String, Option[String]],
 overwrite: Boolean = false,
 ifNotExists: Boolean = false): LogicalPlan =
-  InsertIntoTable(table("s"), partition, plan, overwrite, ifNotExists)
+  InsertIntoTable(
+table("s"), partition, plan,
+OverwriteOptions(
+  overwrite,
+  if (overwrite && partition.nonEmpty) {
+Some(partition.map(kv => (kv._1, kv._2.get)))
--- End diff --

do we need to consider dynamic partition here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15673: [SPARK-17992][SQL] Return all partitions from HiveShim w...

2016-11-01 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15673
  
**[Test build #67949 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67949/consoleFull)**
 for PR 15673 at commit 
[`8d468ac`](https://github.com/apache/spark/commit/8d468ac7097de56989deee124ce65a6583f8eaa8).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15673: [SPARK-17992][SQL] Return all partitions from HiveShim w...

2016-11-01 Thread mallman
Github user mallman commented on the issue:

https://github.com/apache/spark/pull/15673
  
Rebased.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15692: [SPARK-18177][ML][PYSPARK] Add missing 'subsamplingRate'...

2016-11-01 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15692
  
**[Test build #67948 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67948/consoleFull)**
 for PR 15692 at commit 
[`9e97eef`](https://github.com/apache/spark/commit/9e97eefcdf0950e81cc68a6a19f232f00042f473).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15692: [SPARK-18177][ML][PYSPARK] Add missing 'subsamplingRate'...

2016-11-01 Thread zhengruifeng
Github user zhengruifeng commented on the issue:

https://github.com/apache/spark/pull/15692
  
that's ok. I have delete the params.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15677: [SPARK-17963][SQL][Documentation] Add examples (extend) ...

2016-11-01 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15677
  
**[Test build #67946 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67946/consoleFull)**
 for PR 15677 at commit 
[`a6b50eb`](https://github.com/apache/spark/commit/a6b50ebafb01edceca1fc8a729177cdb87da5e20).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15671: [SPARK-18206][ML]Add instrumentation logs to ML training...

2016-11-01 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15671
  
**[Test build #67947 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67947/consoleFull)**
 for PR 15671 at commit 
[`6d2d13f`](https://github.com/apache/spark/commit/6d2d13f79ae68e71d023e7c79d19586842d49c75).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14937: [SPARK-8519][SPARK-11560] [ML] [MLlib] Optimize KMeans i...

2016-11-01 Thread sethah
Github user sethah commented on the issue:

https://github.com/apache/spark/pull/14937
  
A small update: I have run a few tests on a refactored version of this 
patch which avoids some data copying. I have found at least one case where the 
current patch is faster, but many where it is not. I'll try to post formal 
results at some point. (All test cases using dense data btw)

In the meantime, I think it would be helpful to have more detail about the 
tests above. They are rather small datasets. How many centers were used? How 
were the timings observed? Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15725: [SPARK-18167] Print out spark confs, and hive confs when...

2016-11-01 Thread ericl
Github user ericl commented on the issue:

https://github.com/apache/spark/pull/15725
  
Oh cool it flaked on that run

jenkins retest this please. // let's try to get a successful run for 
comparison


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15669: [SPARK-18160][CORE][YARN] spark.files & spark.jars shoul...

2016-11-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15669
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15702: [SPARK-18124] Observed delay based Event Time Watermarks

2016-11-01 Thread koeninger
Github user koeninger commented on the issue:

https://github.com/apache/spark/pull/15702
  
Given the concerns Ofir raised about a single far future event screwing up 
monotonic event time, do you want to document that problem even if there isn't 
an enforced filter for it?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15669: [SPARK-18160][CORE][YARN] spark.files & spark.jars shoul...

2016-11-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15669
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/67941/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15669: [SPARK-18160][CORE][YARN] spark.files & spark.jars shoul...

2016-11-01 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15669
  
**[Test build #67941 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67941/consoleFull)**
 for PR 15669 at commit 
[`0dd5486`](https://github.com/apache/spark/commit/0dd5486f266eff8a76d23236916b0ea458e75de1).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15702: [SPARK-18124] Observed delay based Event Time Wat...

2016-11-01 Thread koeninger
Github user koeninger commented on a diff in the pull request:

https://github.com/apache/spark/pull/15702#discussion_r86066774
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -536,6 +535,41 @@ class Dataset[T] private[sql](
   }
 
   /**
+   * :: Experimental ::
+   * Defines an event time watermark for this [[Dataset]]. A watermark 
tracks a point in time
+   * before which we assume no more late data is going to arrive.
+   *
+   * Spark will use this watermark for several purposes:
+   *  - To know when a given time window aggregation can be finalized and 
thus can be emitted when
+   *using output modes that do not allow updates.
+   *  - To minimize the amount of state that we need to keep for on-going 
aggregations.
+   *
+   *  The current event time is computed by looking at the 
`MAX(eventTime)` seen in an epoch across
+   *  all of the partitions in the query minus a user specified 
`delayThreshold`.  Due to the cost
+   *  of coordinating this value across partitions, the actual watermark 
used is only guaranteed
+   *  to be at least `delayThreshold` behind the actual event time.  In 
some cases we may still
+   *  process records that arrive more than `delayThreshold` late.
+   *
+   * @param eventTime the name of the column that contains the event time 
of the row.
+   * @param delayThreshold the minimum delay to wait to data to arrive 
late, relative to the latest
+   *   record that has been processed in the form of 
an interval
+   *   (e.g. "1 minute" or "5 hours").
--- End diff --

Should this make it clear what the minimum useful granularity is (ms)?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15702: [SPARK-18124] Observed delay based Event Time Wat...

2016-11-01 Thread koeninger
Github user koeninger commented on a diff in the pull request:

https://github.com/apache/spark/pull/15702#discussion_r86067616
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StatefulAggregate.scala
 ---
@@ -104,85 +110,105 @@ case class StateStoreSaveExec(
 
   override protected def doExecute(): RDD[InternalRow] = {
 metrics // force lazy init at driver
-assert(returnAllStates.nonEmpty,
-  "Incorrect planning in IncrementalExecution, returnAllStates have 
not been set")
-val saveAndReturnFunc = if (returnAllStates.get) saveAndReturnAll _ 
else saveAndReturnUpdated _
+assert(outputMode.nonEmpty,
+  "Incorrect planning in IncrementalExecution, outputMode has not been 
set")
+
 child.execute().mapPartitionsWithStateStore(
   getStateId.checkpointLocation,
   operatorId = getStateId.operatorId,
   storeVersion = getStateId.batchId,
   keyExpressions.toStructType,
   child.output.toStructType,
   sqlContext.sessionState,
-  Some(sqlContext.streams.stateStoreCoordinator)
-)(saveAndReturnFunc)
+  Some(sqlContext.streams.stateStoreCoordinator)) { (store, iter) =>
+val getKey = GenerateUnsafeProjection.generate(keyExpressions, 
child.output)
+val numOutputRows = longMetric("numOutputRows")
+val numTotalStateRows = longMetric("numTotalStateRows")
+val numUpdatedStateRows = longMetric("numUpdatedStateRows")
+
+outputMode match {
+  // Update and output all rows in the StateStore.
+  case Some(Complete) =>
+while (iter.hasNext) {
+  val row = iter.next().asInstanceOf[UnsafeRow]
+  val key = getKey(row)
+  store.put(key.copy(), row.copy())
+  numUpdatedStateRows += 1
+}
+store.commit()
+numTotalStateRows += store.numKeys()
+store.iterator().map { case (k, v) =>
+  numOutputRows += 1
+  v.asInstanceOf[InternalRow]
+}
+
+  // Update and output only rows being evicted from the StateStore
+  case Some(Append) =>
+while (iter.hasNext) {
+  val row = iter.next().asInstanceOf[UnsafeRow]
+  val key = getKey(row)
+  store.put(key.copy(), row.copy())
+  numUpdatedStateRows += 1
+}
+
+val watermarkAttribute =
+  
keyExpressions.find(_.metadata.contains(EventTimeWatermark.delayKey)).get
+// If we are evicting based on a window, use the end of the 
window.  Otherwise just
+// use the attribute itself.
+val evictionExpression =
+  if (watermarkAttribute.dataType.isInstanceOf[StructType]) {
+LessThanOrEqual(
+  GetStructField(watermarkAttribute, 1),
+  Literal(eventTimeWatermark.get * 1000))
+  } else {
+LessThanOrEqual(
+  watermarkAttribute,
+  Literal(eventTimeWatermark.get * 1000))
+  }
+
+logInfo(s"Filtering state store on: $evictionExpression")
+val predicate = newPredicate(evictionExpression, 
keyExpressions)
+store.remove(predicate)
+
+store.commit()
+
+numTotalStateRows += store.numKeys()
+store.updates().filter(_.isInstanceOf[ValueRemoved]).map { 
removed =>
+  numOutputRows += 1
+  removed.value.asInstanceOf[InternalRow]
+}
+
+  // Update and output modified rows from the StateStore.
+  case Some(Update) =>
--- End diff --

I'm not clear on why the semantics of Update mean that watermarks shouldn't 
be used to remove state.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15702: [SPARK-18124] Observed delay based Event Time Wat...

2016-11-01 Thread koeninger
Github user koeninger commented on a diff in the pull request:

https://github.com/apache/spark/pull/15702#discussion_r86066376
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -536,6 +535,41 @@ class Dataset[T] private[sql](
   }
 
   /**
+   * :: Experimental ::
+   * Defines an event time watermark for this [[Dataset]]. A watermark 
tracks a point in time
+   * before which we assume no more late data is going to arrive.
+   *
+   * Spark will use this watermark for several purposes:
+   *  - To know when a given time window aggregation can be finalized and 
thus can be emitted when
+   *using output modes that do not allow updates.
+   *  - To minimize the amount of state that we need to keep for on-going 
aggregations.
--- End diff --

For append, this sounds like the intention is emit only once watermark has 
passed, and drop state.
But for other output modes, it's not clear from reading this what the 
effect of the watermark on emission and dropping state is.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15702: [SPARK-18124] Observed delay based Event Time Wat...

2016-11-01 Thread koeninger
Github user koeninger commented on a diff in the pull request:

https://github.com/apache/spark/pull/15702#discussion_r86066082
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -536,6 +535,41 @@ class Dataset[T] private[sql](
   }
 
   /**
+   * :: Experimental ::
+   * Defines an event time watermark for this [[Dataset]]. A watermark 
tracks a point in time
+   * before which we assume no more late data is going to arrive.
+   *
+   * Spark will use this watermark for several purposes:
+   *  - To know when a given time window aggregation can be finalized and 
thus can be emitted when
+   *using output modes that do not allow updates.
+   *  - To minimize the amount of state that we need to keep for on-going 
aggregations.
+   *
+   *  The current event time is computed by looking at the 
`MAX(eventTime)` seen in an epoch across
--- End diff --

- Should this be "The current watermark is computed..." ?
- what is an epoch, it isn't mentioned in the docs or elsewhere in the PR



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15675: [SPARK-18144][SQL] logging StreamingQueryListener$QueryS...

2016-11-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15675
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15675: [SPARK-18144][SQL] logging StreamingQueryListener$QueryS...

2016-11-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15675
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/67942/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   4   5   6   7   8   >