date:20170109

[GitHub] spark pull request #16233: [SPARK-18801][SQL] Support resolve a nested view

2017-01-09 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/16233#discussion_r95313666
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/catalyst/SQLBuilder.scala ---
@@ -219,6 +219,9 @@ class SQLBuilder private (
 case OneRowRelation =>
   ""
 
+case p: View =>
--- End diff --

Maybe add a TODO comment above this line. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2017-01-09 Thread actuaryzhang

Github user actuaryzhang commented on the issue:

https://github.com/apache/spark/pull/16344
  
Jenkins, test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2017-01-09 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16344
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2017-01-09 Thread actuaryzhang

GitHub user actuaryzhang reopened a pull request:

https://github.com/apache/spark/pull/16344

[SPARK-18929][ML] Add Tweedie distribution in GLM

## What changes were proposed in this pull request?
I propose to add the full Tweedie family into the 
GeneralizedLinearRegression model. The Tweedie family is characterized by a 
power variance function. Currently supported distributions such as Gaussian, 
Poisson and Gamma families are a special case of the Tweedie 
https://en.wikipedia.org/wiki/Tweedie_distribution.

@yanboliang @srowen @sethah 

I propose to add support for the other distributions:
- compound Poisson: 1 < varPower < 2. This one is widely used to model 
zero-inflated continuous distributions, e.g., in insurance, finance, ecology, 
meteorology, advertising etc.
- positive stable: varPower > 2 and varPower != 3. Used to model extreme 
values.
- inverse Gaussian: varPower = 3.

The Tweedie family is supported in most statistical packages such as R 
(statmod), SAS, h2o etc.

Changes made:
- Allow `tweedie` in family. Only `identity` and `log` links are allowed 
for now. 
- Add `varPower` to `GeneralizedLinearRegressionBase`, which takes values 
in (1, 2) and (2, infty). Also set default value to 1.5 and add getter method.
- Add `Tweedie` class
- Add tests for tweedie GLM

Note:
- In computing deviance, use `math.max(y, 0.1)` to avoid taking inverse of 
0. This is the same as in R: `tweedie()$dev.res`
- `aic` is not supported in this PR because the evaluation of the [Tweedie 
density](http://www.statsci.org/smyth/pubs/tweediepdf-series-preprint.pdf) in 
these cases are non-trivial. I will implement the density approximation method 
in a future PR.  R returns `null` (see `tweedie()$aic`).


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/actuaryzhang/spark tweedie

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/16344.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16344


commit 952887e485fb0d5fa669b3b4c9289b8069ee7769
Author: actuaryzhang 
Date:   2016-12-16T00:50:51Z

Add Tweedie family to GLM

commit 4f184ec458f5ed7d70bc5b8165481425f911d2a3
Author: actuaryzhang 
Date:   2016-12-19T22:50:02Z

Fix calculation in dev resid; Add test for different var power

commit 7fe39106332663d3671b94a8ffac48ca61c48470
Author: actuaryzhang 
Date:   2016-12-19T23:14:37Z

Merge test into GLR

commit bfcc4fb08d54156efc66b90d14c62ea7ff172afa
Author: actuaryzhang 
Date:   2016-12-20T22:59:05Z

Use Tweedie class instead of global object Tweedie; change variancePower to 
varPower

commit a8feea7d8095170c1b5f18b7887f16a6d763e42c
Author: actuaryzhang 
Date:   2016-12-21T23:42:40Z

Allow Family to use GLRBase object directly

commit 233e2d338be8d36a74eaf578bfea804ae3617d4e
Author: actuaryzhang 
Date:   2016-12-22T01:56:34Z

Add TweedieFamily and implement specific distn within Tweedie

commit 17c55816c914bc96a8b5141356e3c117f343f303
Author: actuaryzhang 
Date:   2016-12-22T04:39:54Z

Clean up doc

commit 0b41825e99020976a34d8fe9c983f26de6c8c40f
Author: actuaryzhang 
Date:   2016-12-22T17:52:01Z

Move defaultLink and name to subclass of TweedieFamily

commit 6e8e60771afb4abe43e47c7fe186bad1541a8fac
Author: actuaryzhang 
Date:   2016-12-22T18:10:51Z

Change style for AIC

commit 8d7d34e258f9c7c03c80754d837ce847fcb0526e
Author: actuaryzhang 
Date:   2016-12-23T19:10:20Z

Rename Family methods and restore methods for tweedie subclasses

commit 6da7e3068e2c45a0faf7ff35c10b2750784d765e
Author: actuaryzhang 
Date:   2016-12-23T19:12:25Z

Update test

commit 9a71e89f629260c775922901a04c989f36ea4946
Author: actuaryzhang 
Date:   2016-12-27T17:16:40Z

Clean up doc

commit f461c09e65360f695ad3092b41bc26e0c61bbd95
Author: actuaryzhang 
Date:   2016-12-27T22:18:39Z

Put delta in Tweedie companion object

commit a839c4631dd17c4f3d0a0cc99e1b0af81419dda4
Author: actuaryzhang 
Date:   2016-12-27T22:23:57Z

Clean up doc

commit fab265278109eede4cce7ee506e8b29d481c4549
Author: actuaryzhang 
Date:   2017-01-05T19:32:06Z

Allow more link functions in tweedie




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working,

[GitHub] spark issue #15480: [SPARK-16845][SQL] `GeneratedClass$SpecificOrdering` gro...

2017-01-09 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15480
  
**[Test build #71115 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71115/testReport)**
 for PR 15480 at commit 
[`4aef473`](https://github.com/apache/spark/commit/4aef473b70ef20d7a93fd974a3d844e49ca6ef9e).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16487: [SPARK-19107][SQL] support creating hive table with Data...

2017-01-09 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16487
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16487: [SPARK-19107][SQL] support creating hive table with Data...

2017-01-09 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16487
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/71112/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16487: [SPARK-19107][SQL] support creating hive table with Data...

2017-01-09 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16487
  
**[Test build #71112 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71112/testReport)**
 for PR 16487 at commit 
[`27d97e5`](https://github.com/apache/spark/commit/27d97e56295465bb717da9ff08d2571998adf2b1).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #12064: [SPARK-14272][ML] Evaluate GaussianMixtureModel with Log...

2017-01-09 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/12064
  
**[Test build #71114 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71114/testReport)**
 for PR 12064 at commit 
[`083e4f9`](https://github.com/apache/spark/commit/083e4f932ee9bd65d211bcb8506d53bb35c160f6).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16233: [SPARK-18801][SQL] Support resolve a nested view

2017-01-09 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/16233#discussion_r95310611
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/view.scala 
---
@@ -0,0 +1,90 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.analysis
+
+import org.apache.spark.sql.AnalysisException
+import org.apache.spark.sql.catalyst.CatalystConf
+import org.apache.spark.sql.catalyst.expressions.{Alias, Attribute, Cast}
+import org.apache.spark.sql.catalyst.plans.logical.{LogicalPlan, Project, 
View}
+import org.apache.spark.sql.catalyst.rules.Rule
+
+/**
+ * This file defines analysis rules related to views.
+ */
+
+/**
+ * Make sure that a view's child plan produces the view's output 
attributes. We wrap the child
+ * with a Project and add an alias for each output attribute. The 
attributes are resolved by
+ * name. This should be only done after the resolution batch, because the 
view attributes are
+ * not stable during resolution.
+ */
+case class AliasViewChild(conf: CatalystConf) extends Rule[LogicalPlan] {
+  override def apply(plan: LogicalPlan): LogicalPlan = plan 
resolveOperators {
+case v @ View(_, output, child) if child.resolved =>
+  val resolver = conf.resolver
+  val newOutput = child.output.map { attr =>
+val newAttr = findAttributeByName(attr.name, output, resolver)
+// Check the dataType of the output attributes, throw an 
AnalysisException if they don't
+// match up.
+checkDataType(attr, newAttr)
+Alias(attr, attr.name)(exprId = newAttr.exprId, qualifier = 
newAttr.qualifier,
+  explicitMetadata = Some(newAttr.metadata))
+  }
+  v.copy(child = Project(newOutput, child))
+  }
+
+  /**
+   * Find the attribute that has the expected attribute name from an 
attribute list, the names
+   * are compared using conf.resolver.
+   * If the expected attribute is not found, throw an AnalysisException.
+   */
+  private def findAttributeByName(
+  name: String,
+  attrs: Seq[Attribute],
+  resolver: Resolver): Attribute = {
+attrs.collectFirst {
+  case attr if resolver(attr.name, name) => attr
+}.getOrElse(throw new AnalysisException(
+  s"Attribute with name '$name' is not found in " +
+s"'${attrs.map(_.name).mkString("(", ",", ")")}'"))
+  }
+
+  /**
+   * Check whether the dataType of `attr` could be casted to that of 
`other`, throw an
+   * AnalysisException if the both attributes don't match up.
+   */
+  private def checkDataType(attr: Attribute, other: Attribute): Unit = {
+if (!Cast.canCast(attr.dataType, other.dataType)) {
+  throw new AnalysisException(
+s"The dataType of attribute '$other' is '${other.dataType}', which 
can't be casted to " +
+  s"that of '$attr', expected '${attr.dataType}'.")
+}
+  }
+}
+
+/**
+ * Removes [[View]] operators from the plan. The operator is respected 
till the end of analysis
+ * stage because we want to see which part of a analyzed logical plan is 
generated from a view.
+ */
+object EliminateView extends Rule[LogicalPlan] {
+  def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {
+// The child should have the same output attributes with the View 
operator, so we simply
--- End diff --

: ) I did not see your comment.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail:

[GitHub] spark issue #16476: [SPARK-19084][SQL] Implement expression field

2017-01-09 Thread gczsjdy

Github user gczsjdy commented on the issue:

https://github.com/apache/spark/pull/16476
  
@chenghao-intel 
I think that the optimize rule will fold the parameters that have different 
types with param0, and then disorganize the parameters' index. Thanks.
@rxin 
I have removed the [WIP] mark, this PR is ready now, could you please help 
me review it? Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15480: [SPARK-16845][SQL] `GeneratedClass$SpecificOrderi...

2017-01-09 Thread lw-lin

Github user lw-lin commented on a diff in the pull request:

https://github.com/apache/spark/pull/15480#discussion_r95310486
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateOrdering.scala
 ---
@@ -117,8 +117,31 @@ object GenerateOrdering extends 
CodeGenerator[Seq[SortOrder], Ordering[InternalR
 }
   }
   """
-}.mkString("\n")
-comparisons
+}
+
+ctx.splitExpressions(
+  expressions = comparisons,
+  funcName = "compare",
+  arguments = Seq(("InternalRow", "a"), ("InternalRow", "b")),
+  returnType = "int",
+  makeSplitFunction = { body =>
+s"""
+  InternalRow ${ctx.INPUT_ROW} = null;  // Holds current row being 
evaluated.
+  $body
+  return 0;
+"""
+  },
+  foldFunctions = { funCalls =>
+val comp = ctx.freshName("comp")
+funCalls.zipWithIndex.map { case (funCall, i) =>
+  s"""
+int ${comp}_$i = $funCall;
--- End diff --

Ah i see - let me update this :-)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16233: [SPARK-18801][SQL] Support resolve a nested view

2017-01-09 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/16233#discussion_r95310499
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/view.scala 
---
@@ -0,0 +1,90 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.analysis
+
+import org.apache.spark.sql.AnalysisException
+import org.apache.spark.sql.catalyst.CatalystConf
+import org.apache.spark.sql.catalyst.expressions.{Alias, Attribute, Cast}
+import org.apache.spark.sql.catalyst.plans.logical.{LogicalPlan, Project, 
View}
+import org.apache.spark.sql.catalyst.rules.Rule
+
+/**
+ * This file defines analysis rules related to views.
+ */
+
+/**
+ * Make sure that a view's child plan produces the view's output 
attributes. We wrap the child
+ * with a Project and add an alias for each output attribute. The 
attributes are resolved by
+ * name. This should be only done after the resolution batch, because the 
view attributes are
+ * not stable during resolution.
+ */
+case class AliasViewChild(conf: CatalystConf) extends Rule[LogicalPlan] {
+  override def apply(plan: LogicalPlan): LogicalPlan = plan 
resolveOperators {
+case v @ View(_, output, child) if child.resolved =>
+  val resolver = conf.resolver
+  val newOutput = child.output.map { attr =>
+val newAttr = findAttributeByName(attr.name, output, resolver)
+// Check the dataType of the output attributes, throw an 
AnalysisException if they don't
+// match up.
+checkDataType(attr, newAttr)
+Alias(attr, attr.name)(exprId = newAttr.exprId, qualifier = 
newAttr.qualifier,
+  explicitMetadata = Some(newAttr.metadata))
+  }
+  v.copy(child = Project(newOutput, child))
+  }
+
+  /**
+   * Find the attribute that has the expected attribute name from an 
attribute list, the names
+   * are compared using conf.resolver.
+   * If the expected attribute is not found, throw an AnalysisException.
+   */
+  private def findAttributeByName(
+  name: String,
+  attrs: Seq[Attribute],
+  resolver: Resolver): Attribute = {
+attrs.collectFirst {
+  case attr if resolver(attr.name, name) => attr
+}.getOrElse(throw new AnalysisException(
+  s"Attribute with name '$name' is not found in " +
+s"'${attrs.map(_.name).mkString("(", ",", ")")}'"))
+  }
+
+  /**
+   * Check whether the dataType of `attr` could be casted to that of 
`other`, throw an
+   * AnalysisException if the both attributes don't match up.
+   */
+  private def checkDataType(attr: Attribute, other: Attribute): Unit = {
+if (!Cast.canCast(attr.dataType, other.dataType)) {
+  throw new AnalysisException(
+s"The dataType of attribute '$other' is '${other.dataType}', which 
can't be casted to " +
+  s"that of '$attr', expected '${attr.dataType}'.")
+}
+  }
+}
+
+/**
+ * Removes [[View]] operators from the plan. The operator is respected 
till the end of analysis
+ * stage because we want to see which part of a analyzed logical plan is 
generated from a view.
+ */
+object EliminateView extends Rule[LogicalPlan] {
+  def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {
+// The child should have the same output attributes with the View 
operator, so we simply
--- End diff --

Can we add an assert below to ensure the outputs are the same?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For

[GitHub] spark issue #16487: [SPARK-19107][SQL] support creating hive table with Data...

2017-01-09 Thread yhuai

Github user yhuai commented on the issue:

https://github.com/apache/spark/pull/16487
  
LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16510: [SPARK-19130][SPARKR] Support setting literal value as c...

2017-01-09 Thread rxin

Github user rxin commented on the issue:

https://github.com/apache/spark/pull/16510
  
This is pretty useful!



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16233: [SPARK-18801][SQL] Support resolve a nested view

2017-01-09 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/16233#discussion_r95309414
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/view.scala 
---
@@ -0,0 +1,89 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.analysis
+
+import org.apache.spark.sql.AnalysisException
+import org.apache.spark.sql.catalyst.CatalystConf
+import org.apache.spark.sql.catalyst.expressions.{Alias, Attribute, Cast}
+import org.apache.spark.sql.catalyst.plans.logical.{LogicalPlan, Project, 
View}
+import org.apache.spark.sql.catalyst.rules.Rule
+
+/**
+ * This file defines analysis rules related to views.
+ */
+
+/**
+ * Make sure that a view's child plan produces the view's output 
attributes. We wrap the child
+ * with a Project and add an alias for each output attribute. The 
attributes are resolved by
+ * name. This should be only done after the resolution batch, because the 
view attributes are
+ * not stable during resolution.
+ */
+case class AliasViewChild(conf: CatalystConf) extends Rule[LogicalPlan] {
+  override def apply(plan: LogicalPlan): LogicalPlan = plan 
resolveOperators {
+case v @ View(_, output, child) if child.resolved =>
+  val resolver = conf.resolver
+  val newOutput = child.output.map { attr =>
+val newAttr = findAttributeByName(attr.name, output, resolver)
+// Check the dataType of the output attributes, throw an 
AnalysisException if they don't
+// match up.
+checkDataType(attr, newAttr)
+Alias(attr, attr.name)(exprId = newAttr.exprId, qualifier = 
newAttr.qualifier,
+  explicitMetadata = Some(newAttr.metadata))
+  }
+  v.copy(child = Project(newOutput, child))
+  }
+
+  /**
+   * Find the attribute that has the expected attribute name from an 
attribute list, the names
+   * are compared using conf.resolver.
+   * If the expected attribute is not found, throw an AnalysisException.
+   */
+  private def findAttributeByName(
+  name: String,
+  attrs: Seq[Attribute],
+  resolver: Resolver): Attribute = {
+attrs.collectFirst {
+  case attr if resolver(attr.name, name) => attr
+}.getOrElse(throw new AnalysisException(
+  s"Attribute with name '$name' is not found in " +
+s"'${attrs.map(_.name).mkString("(", ",", ")")}'"))
+  }
+
+  /**
+   * Check whether the dataType of `attr` could be casted to that of 
`other`, throw an
+   * AnalysisException if the both attributes don't match up.
+   */
+  private def checkDataType(attr: Attribute, other: Attribute): Unit = {
+if (!Cast.canCast(attr.dataType, other.dataType)) {
--- End diff --

The rule `ImplicitTypeCasts` is done in the batch of `Resolution`, right?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16517: [SPARK-18243][SQL] Port Hive writing to use FileFormat i...

2017-01-09 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16517
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/7/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16233: [SPARK-18801][SQL] Support resolve a nested view

2017-01-09 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/16233#discussion_r95309984
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/view.scala 
---
@@ -0,0 +1,90 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.analysis
+
+import org.apache.spark.sql.AnalysisException
+import org.apache.spark.sql.catalyst.CatalystConf
+import org.apache.spark.sql.catalyst.expressions.{Alias, Attribute, Cast}
+import org.apache.spark.sql.catalyst.plans.logical.{LogicalPlan, Project, 
View}
+import org.apache.spark.sql.catalyst.rules.Rule
+
+/**
+ * This file defines analysis rules related to views.
+ */
+
+/**
+ * Make sure that a view's child plan produces the view's output 
attributes. We wrap the child
+ * with a Project and add an alias for each output attribute. The 
attributes are resolved by
+ * name. This should be only done after the resolution batch, because the 
view attributes are
+ * not stable during resolution.
+ */
+case class AliasViewChild(conf: CatalystConf) extends Rule[LogicalPlan] {
+  override def apply(plan: LogicalPlan): LogicalPlan = plan 
resolveOperators {
+case v @ View(_, output, child) if child.resolved =>
+  val resolver = conf.resolver
+  val newOutput = child.output.map { attr =>
+val newAttr = findAttributeByName(attr.name, output, resolver)
+// Check the dataType of the output attributes, throw an 
AnalysisException if they don't
+// match up.
+checkDataType(attr, newAttr)
+Alias(attr, attr.name)(exprId = newAttr.exprId, qualifier = 
newAttr.qualifier,
+  explicitMetadata = Some(newAttr.metadata))
+  }
+  v.copy(child = Project(newOutput, child))
+  }
+
+  /**
+   * Find the attribute that has the expected attribute name from an 
attribute list, the names
+   * are compared using conf.resolver.
+   * If the expected attribute is not found, throw an AnalysisException.
+   */
+  private def findAttributeByName(
+  name: String,
+  attrs: Seq[Attribute],
+  resolver: Resolver): Attribute = {
+attrs.collectFirst {
+  case attr if resolver(attr.name, name) => attr
+}.getOrElse(throw new AnalysisException(
+  s"Attribute with name '$name' is not found in " +
+s"'${attrs.map(_.name).mkString("(", ",", ")")}'"))
+  }
+
+  /**
+   * Check whether the dataType of `attr` could be casted to that of 
`other`, throw an
+   * AnalysisException if the both attributes don't match up.
+   */
+  private def checkDataType(attr: Attribute, other: Attribute): Unit = {
+if (!Cast.canCast(attr.dataType, other.dataType)) {
+  throw new AnalysisException(
+s"The dataType of attribute '$other' is '${other.dataType}', which 
can't be casted to " +
+  s"that of '$attr', expected '${attr.dataType}'.")
+}
+  }
+}
+
+/**
+ * Removes [[View]] operators from the plan. The operator is respected 
till the end of analysis
+ * stage because we want to see which part of a analyzed logical plan is 
generated from a view.
--- End diff --

Nit: `a analyzed` ->`an analyzed`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16233: [SPARK-18801][SQL] Support resolve a nested view

2017-01-09 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/16233#discussion_r95309902
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/view.scala 
---
@@ -0,0 +1,90 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.analysis
+
+import org.apache.spark.sql.AnalysisException
+import org.apache.spark.sql.catalyst.CatalystConf
+import org.apache.spark.sql.catalyst.expressions.{Alias, Attribute, Cast}
+import org.apache.spark.sql.catalyst.plans.logical.{LogicalPlan, Project, 
View}
+import org.apache.spark.sql.catalyst.rules.Rule
+
+/**
+ * This file defines analysis rules related to views.
+ */
+
+/**
+ * Make sure that a view's child plan produces the view's output 
attributes. We wrap the child
+ * with a Project and add an alias for each output attribute. The 
attributes are resolved by
+ * name. This should be only done after the resolution batch, because the 
view attributes are
+ * not stable during resolution.
+ */
+case class AliasViewChild(conf: CatalystConf) extends Rule[LogicalPlan] {
+  override def apply(plan: LogicalPlan): LogicalPlan = plan 
resolveOperators {
+case v @ View(_, output, child) if child.resolved =>
+  val resolver = conf.resolver
+  val newOutput = child.output.map { attr =>
+val newAttr = findAttributeByName(attr.name, output, resolver)
+// Check the dataType of the output attributes, throw an 
AnalysisException if they don't
+// match up.
+checkDataType(attr, newAttr)
+Alias(attr, attr.name)(exprId = newAttr.exprId, qualifier = 
newAttr.qualifier,
+  explicitMetadata = Some(newAttr.metadata))
+  }
+  v.copy(child = Project(newOutput, child))
+  }
+
+  /**
+   * Find the attribute that has the expected attribute name from an 
attribute list, the names
+   * are compared using conf.resolver.
+   * If the expected attribute is not found, throw an AnalysisException.
+   */
+  private def findAttributeByName(
+  name: String,
+  attrs: Seq[Attribute],
+  resolver: Resolver): Attribute = {
+attrs.collectFirst {
+  case attr if resolver(attr.name, name) => attr
+}.getOrElse(throw new AnalysisException(
+  s"Attribute with name '$name' is not found in " +
+s"'${attrs.map(_.name).mkString("(", ",", ")")}'"))
+  }
+
+  /**
+   * Check whether the dataType of `attr` could be casted to that of 
`other`, throw an
+   * AnalysisException if the both attributes don't match up.
+   */
+  private def checkDataType(attr: Attribute, other: Attribute): Unit = {
+if (!Cast.canCast(attr.dataType, other.dataType)) {
+  throw new AnalysisException(
+s"The dataType of attribute '$other' is '${other.dataType}', which 
can't be casted to " +
+  s"that of '$attr', expected '${attr.dataType}'.")
+}
+  }
+}
+
+/**
+ * Removes [[View]] operators from the plan. The operator is respected 
till the end of analysis
+ * stage because we want to see which part of a analyzed logical plan is 
generated from a view.
+ */
+object EliminateView extends Rule[LogicalPlan] {
+  def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {
--- End diff --

Nit: `transformUp` -> `transform`. When the order does not matter, we use 
`transform`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16233: [SPARK-18801][SQL] Support resolve a nested view

2017-01-09 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/16233#discussion_r95309603
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/view.scala 
---
@@ -0,0 +1,90 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.analysis
+
+import org.apache.spark.sql.AnalysisException
+import org.apache.spark.sql.catalyst.CatalystConf
+import org.apache.spark.sql.catalyst.expressions.{Alias, Attribute, Cast}
+import org.apache.spark.sql.catalyst.plans.logical.{LogicalPlan, Project, 
View}
+import org.apache.spark.sql.catalyst.rules.Rule
+
+/**
+ * This file defines analysis rules related to views.
+ */
+
+/**
+ * Make sure that a view's child plan produces the view's output 
attributes. We wrap the child
+ * with a Project and add an alias for each output attribute. The 
attributes are resolved by
+ * name. This should be only done after the resolution batch, because the 
view attributes are
+ * not stable during resolution.
+ */
+case class AliasViewChild(conf: CatalystConf) extends Rule[LogicalPlan] {
+  override def apply(plan: LogicalPlan): LogicalPlan = plan 
resolveOperators {
+case v @ View(_, output, child) if child.resolved =>
+  val resolver = conf.resolver
+  val newOutput = child.output.map { attr =>
+val newAttr = findAttributeByName(attr.name, output, resolver)
+// Check the dataType of the output attributes, throw an 
AnalysisException if they don't
+// match up.
+checkDataType(attr, newAttr)
+Alias(attr, attr.name)(exprId = newAttr.exprId, qualifier = 
newAttr.qualifier,
+  explicitMetadata = Some(newAttr.metadata))
+  }
+  v.copy(child = Project(newOutput, child))
+  }
+
+  /**
+   * Find the attribute that has the expected attribute name from an 
attribute list, the names
+   * are compared using conf.resolver.
+   * If the expected attribute is not found, throw an AnalysisException.
+   */
+  private def findAttributeByName(
+  name: String,
+  attrs: Seq[Attribute],
+  resolver: Resolver): Attribute = {
+attrs.collectFirst {
+  case attr if resolver(attr.name, name) => attr
+}.getOrElse(throw new AnalysisException(
+  s"Attribute with name '$name' is not found in " +
+s"'${attrs.map(_.name).mkString("(", ",", ")")}'"))
+  }
+
+  /**
+   * Check whether the dataType of `attr` could be casted to that of 
`other`, throw an
+   * AnalysisException if the both attributes don't match up.
+   */
+  private def checkDataType(attr: Attribute, other: Attribute): Unit = {
+if (!Cast.canCast(attr.dataType, other.dataType)) {
+  throw new AnalysisException(
+s"The dataType of attribute '$other' is '${other.dataType}', which 
can't be casted to " +
+  s"that of '$attr', expected '${attr.dataType}'.")
+}
+  }
+}
+
+/**
+ * Removes [[View]] operators from the plan. The operator is respected 
till the end of analysis
+ * stage because we want to see which part of a analyzed logical plan is 
generated from a view.
+ */
+object EliminateView extends Rule[LogicalPlan] {
+  def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {
+// The child should have the same output attributes with the View 
operator, so we simply
+// remove the View operator.
+case View(_, output, child) => child
--- End diff --

Nit: `case View(_, output, child) => child` -> `case View(_, _, child) => 
child`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request #16233: [SPARK-18801][SQL] Support resolve a nested view

2017-01-09 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/16233#discussion_r95309481
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/view.scala 
---
@@ -0,0 +1,89 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.analysis
+
+import org.apache.spark.sql.AnalysisException
+import org.apache.spark.sql.catalyst.CatalystConf
+import org.apache.spark.sql.catalyst.expressions.{Alias, Attribute, Cast}
+import org.apache.spark.sql.catalyst.plans.logical.{LogicalPlan, Project, 
View}
+import org.apache.spark.sql.catalyst.rules.Rule
+
+/**
+ * This file defines analysis rules related to views.
+ */
+
+/**
+ * Make sure that a view's child plan produces the view's output 
attributes. We wrap the child
+ * with a Project and add an alias for each output attribute. The 
attributes are resolved by
+ * name. This should be only done after the resolution batch, because the 
view attributes are
+ * not stable during resolution.
+ */
+case class AliasViewChild(conf: CatalystConf) extends Rule[LogicalPlan] {
+  override def apply(plan: LogicalPlan): LogicalPlan = plan 
resolveOperators {
+case v @ View(_, output, child) if child.resolved =>
+  val resolver = conf.resolver
+  val newOutput = child.output.map { attr =>
+val newAttr = findAttributeByName(attr.name, output, resolver)
+// Check the dataType of the output attributes, throw an 
AnalysisException if they don't
+// match up.
+checkDataType(attr, newAttr)
+Alias(attr, attr.name)(exprId = newAttr.exprId, qualifier = 
newAttr.qualifier,
+  explicitMetadata = Some(newAttr.metadata))
+  }
+  v.copy(child = Project(newOutput, child))
+  }
+
+  /**
+   * Find the attribute that has the expected attribute name from an 
attribute list, the names
+   * are compared using conf.resolver.
+   * If the expected attribute is not found, throw an AnalysisException.
+   */
+  private def findAttributeByName(
+  name: String,
+  attrs: Seq[Attribute],
+  resolver: Resolver): Attribute = {
+attrs.collectFirst {
+  case attr if resolver(attr.name, name) => attr
+}.getOrElse(throw new AnalysisException(
+  s"Attribute with name '$name' is not found in " +
+s"'${attrs.map(_.name).mkString("(", ",", ")")}'"))
+  }
+
+  /**
+   * Check whether the dataType of `attr` could be casted to that of 
`other`, throw an
+   * AnalysisException if the both attributes don't match up.
+   */
+  private def checkDataType(attr: Attribute, other: Attribute): Unit = {
+if (!Cast.canCast(attr.dataType, other.dataType)) {
--- End diff --

Have you added any test case for type casting?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16517: [SPARK-18243][SQL] Port Hive writing to use FileFormat i...

2017-01-09 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16517
  
**[Test build #7 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/7/testReport)**
 for PR 16517 at commit 
[`36c9269`](https://github.com/apache/spark/commit/36c9269c45b4bdca3be36a4e73bf48f8e273a9a3).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16517: [SPARK-18243][SQL] Port Hive writing to use FileFormat i...

2017-01-09 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16517
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15480: [SPARK-16845][SQL] `GeneratedClass$SpecificOrderi...

2017-01-09 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/15480#discussion_r95309134
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateOrdering.scala
 ---
@@ -117,8 +117,31 @@ object GenerateOrdering extends 
CodeGenerator[Seq[SortOrder], Ordering[InternalR
 }
   }
   """
-}.mkString("\n")
-comparisons
+}
+
+ctx.splitExpressions(
+  expressions = comparisons,
+  funcName = "compare",
+  arguments = Seq(("InternalRow", "a"), ("InternalRow", "b")),
+  returnType = "int",
+  makeSplitFunction = { body =>
+s"""
+  InternalRow ${ctx.INPUT_ROW} = null;  // Holds current row being 
evaluated.
+  $body
+  return 0;
+"""
+  },
+  foldFunctions = { funCalls =>
+val comp = ctx.freshName("comp")
+funCalls.zipWithIndex.map { case (funCall, i) =>
+  s"""
+int ${comp}_$i = $funCall;
--- End diff --

oh sorry, we should put the `freshName` in the map function
```
funCalls.zipWithIndex.map { case (funCall, i) =>
  val comp = ctx.freshName("comp")
  s"""
int $comp = $funCall;
 ...
}
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16233: [SPARK-18801][SQL] Support resolve a nested view

2017-01-09 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/16233#discussion_r95308910
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/view.scala 
---
@@ -0,0 +1,90 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.analysis
+
+import org.apache.spark.sql.AnalysisException
+import org.apache.spark.sql.catalyst.CatalystConf
+import org.apache.spark.sql.catalyst.expressions.{Alias, Attribute, Cast}
+import org.apache.spark.sql.catalyst.plans.logical.{LogicalPlan, Project, 
View}
+import org.apache.spark.sql.catalyst.rules.Rule
+
+/**
+ * This file defines analysis rules related to views.
+ */
+
+/**
+ * Make sure that a view's child plan produces the view's output 
attributes. We wrap the child
+ * with a Project and add an alias for each output attribute. The 
attributes are resolved by
+ * name. This should be only done after the resolution batch, because the 
view attributes are
+ * not stable during resolution.
--- End diff --

`stable` -> `completely resolved`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16233: [SPARK-18801][SQL] Support resolve a nested view

2017-01-09 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/16233#discussion_r95308787
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/view.scala 
---
@@ -0,0 +1,90 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.analysis
+
+import org.apache.spark.sql.AnalysisException
+import org.apache.spark.sql.catalyst.CatalystConf
+import org.apache.spark.sql.catalyst.expressions.{Alias, Attribute, Cast}
+import org.apache.spark.sql.catalyst.plans.logical.{LogicalPlan, Project, 
View}
+import org.apache.spark.sql.catalyst.rules.Rule
+
+/**
+ * This file defines analysis rules related to views.
+ */
+
+/**
+ * Make sure that a view's child plan produces the view's output 
attributes. We wrap the child
+ * with a Project and add an alias for each output attribute. The 
attributes are resolved by
+ * name. This should be only done after the resolution batch, because the 
view attributes are
--- End diff --

Nit: `after the resolution batch` -> `after the batch of Resolution`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16497: [SPARK-19118] [SQL] Percentile support for freque...

2017-01-09 Thread jiangxb1987

Github user jiangxb1987 commented on a diff in the pull request:

https://github.com/apache/spark/pull/16497#discussion_r95308691
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Percentile.scala
 ---
@@ -81,7 +96,11 @@ case class Percentile(
   case arrayData: ArrayData => arrayData.toDoubleArray().toSeq
   }
 
-  override def children: Seq[Expression] = child :: percentageExpression 
:: Nil
+  override def children: Seq[Expression] = if (withFrqExpr) {
--- End diff --

Let's make the param sequence to: `child, percentageExpression, 
frequencyExpression`, and give a default value to `frequencyExpression`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16522: [SPARK-19137][SQL] Fix `withSQLConf` to reset `OptionalC...

2017-01-09 Thread dongjoon-hyun

Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/16522
  
Hi, @vanzin and @zsxwing .
The PR passes the tests. Could you review this PR again?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16440: [SPARK-18857][SQL] Don't use `Iterator.duplicate` for `i...

2017-01-09 Thread dongjoon-hyun

Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/16440
  
Hi, @srowen .
Could you merge this PR?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16487: [SPARK-19107][SQL] support creating hive table with Data...

2017-01-09 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/16487
  
LGTM pending test


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16233: [SPARK-18801][SQL] Support resolve a nested view

2017-01-09 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/16233#discussion_r95307494
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -510,32 +542,90 @@ class Analyzer(
* Replaces [[UnresolvedRelation]]s with concrete relations from the 
catalog.
*/
   object ResolveRelations extends Rule[LogicalPlan] {
-private def lookupTableFromCatalog(u: UnresolvedRelation): LogicalPlan 
= {
+
+// If the unresolved relation is running directly on files, we just 
return the original
+// UnresolvedRelation, the plan will get resolved later. Else we look 
up the table from catalog
+// and change the default database name if it is a view.
+// We usually look up a table from the default database if the table 
identifier has an empty
+// database part, for a view the default database should be the 
currentDb when the view was
+// created. When the case comes to resolving a nested view, the view 
may have different default
+// database with that the referenced view has, so we need to use the 
variable `defaultDatabase`
+// to track the current default database.
+// When the relation we resolve is a view, we fetch the 
view.desc(which is a CatalogTable), and
+// then set the value of `CatalogTable.viewDefaultDatabase` to the 
variable `defaultDatabase`,
+// we look up the relations that the view references using the default 
database.
+// For example:
+// |- view1 (defaultDatabase = db1)
+//   |- operator
+// |- table2 (defaultDatabase = db1)
+// |- view2 (defaultDatabase = db2)
+//|- view3 (defaultDatabase = db3)
+//   |- view4 (defaultDatabase = db4)
+// In this case, the view `view1` is a nested view, it directly 
references `table2`ã`view2`
+// and `view4`, the view `view2` references `view3`. On resolving the 
table, we look up the
+// relations `table2`ã`view2`ã`view4` using the default database 
`db1`, and look up the
+// relation `view3` using the default database `db2`.
+//
+// Note this is compatible with the views defined by older versions of 
Spark(before 2.2), which
+// have empty defaultDatabase and all the relations in viewText have 
database part defined.
+def resolveRelation(
+plan: LogicalPlan,
+defaultDatabase: Option[String] = None): LogicalPlan = plan match {
+  case u: UnresolvedRelation if 
!isRunningDirectlyOnFiles(u.tableIdentifier) =>
+val defaultDatabase = AnalysisContext.get.defaultDatabase
+val relation = lookupTableFromCatalog(u, defaultDatabase)
+resolveRelation(relation, defaultDatabase)
+  // The view's child should be a logical plan parsed from the 
`desc.viewText`, the variable
+  // `viewText` should be defined, or else we throw an error on the 
generation of the View
+  // operator.
+  case view @ View(desc, _, child) if !child.resolved =>
+val nestedViewLevel = AnalysisContext.get.nestedViewLevel + 1
+val context = AnalysisContext(defaultDatabase = 
desc.viewDefaultDatabase,
+  nestedViewLevel = nestedViewLevel)
+// Resolve all the UnresolvedRelations and Views in the child.
+val newChild = AnalysisContext.withAnalysisContext(context) {
+  execute(child)
+}
+view.copy(child = newChild)
+  case p @ SubqueryAlias(_, view: View, _) =>
+val newChild = resolveRelation(view, defaultDatabase)
+p.copy(child = newChild)
+  case _ => plan
+}
+
+def apply(plan: LogicalPlan): LogicalPlan = plan resolveOperators {
+  case i @ InsertIntoTable(u: UnresolvedRelation, parts, child, _, _) 
if child.resolved =>
+i.copy(table = EliminateSubqueryAliases(lookupTableFromCatalog(u)))
+  case u: UnresolvedRelation => resolveRelation(u)
+}
+
+// Look up the table with the given name from catalog. The database we 
look up the table from
+// is decided follow the steps:
+// 1. If the database part is defined in the table identifier, use 
that database name;
+// 2. Else If the defaultDatabase is defined, use the default database 
name(In this case, no
+//temporary objects can be used, and the default database is only 
used to look up a view);
+// 3. Else use the currentDb of the SessionCatalog.
--- End diff --

```
1. u.tableIdentifier.database, if defined
2. defaultDatabase, if defined
3. currentDb of SessionCatalog, otherwise.  
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this

[GitHub] spark issue #16523: [SPARK-19142][SparkR]:spark.kmeans should take seed, ini...

2017-01-09 Thread wangmiao1981

Github user wangmiao1981 commented on the issue:

https://github.com/apache/spark/pull/16523
  
cc @yanboliang 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16497: [SPARK-19118] [SQL] Percentile support for freque...

2017-01-09 Thread tanejagagan

Github user tanejagagan commented on a diff in the pull request:

https://github.com/apache/spark/pull/16497#discussion_r95307385
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Percentile.scala
 ---
@@ -51,16 +51,31 @@ import org.apache.spark.util.collection.OpenHashMap
   _FUNC_(col, array(percentage1 [, percentage2]...)) - Returns the 
exact percentile value array
   of numeric column `col` at the given percentage(s). Each value of 
the percentage array must
   be between 0.0 and 1.0.
-""")
+
+  _FUNC_(col, frequency, percentage) - Returns the exact percentile 
value of numeric
+  column `col` with frequency column `frequency` at the given 
percentage. The value of
+  percentage must be between 0.0 and 1.0.
+
+  _FUNC_(col, frequency, array(percentage1 [, percentage2]...)) - 
Returns the exact percentile
+  value array of numeric column `col` with frequency column 
`frequency` at the given
+  percentage(s).Each value of the percentage array must be between 0.0 
and 1.0.
+
+  """)
 case class Percentile(
 child: Expression,
+frequency : Expression,
--- End diff --

i think frequenctExpression would be correct name but i have sequence them 
as they would appear in the SQL
select percentile( col, frq, percentage ) from table
where frq is Optional


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16497: [SPARK-19118] [SQL] Percentile support for freque...

2017-01-09 Thread tanejagagan

Github user tanejagagan commented on a diff in the pull request:

https://github.com/apache/spark/pull/16497#discussion_r95307232
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Percentile.scala
 ---
@@ -51,16 +51,31 @@ import org.apache.spark.util.collection.OpenHashMap
   _FUNC_(col, array(percentage1 [, percentage2]...)) - Returns the 
exact percentile value array
   of numeric column `col` at the given percentage(s). Each value of 
the percentage array must
   be between 0.0 and 1.0.
-""")
+
+  _FUNC_(col, frequency, percentage) - Returns the exact percentile 
value of numeric
+  column `col` with frequency column `frequency` at the given 
percentage. The value of
+  percentage must be between 0.0 and 1.0.
+
+  _FUNC_(col, frequency, array(percentage1 [, percentage2]...)) - 
Returns the exact percentile
+  value array of numeric column `col` with frequency column 
`frequency` at the given
+  percentage(s).Each value of the percentage array must be between 0.0 
and 1.0.
+
+  """)
 case class Percentile(
 child: Expression,
+frequency : Expression,
 percentageExpression: Expression,
+withFrqExpr : Boolean,
--- End diff --

Please see my comment below why we need to make a distinction either using 
flag or Option

I am inclined towards using a flag because switching to option would change 
the code in update
from
val frqValue = frequency.eval(input)
to 
val frqValue = frequency.getOrElse( unit).eval(input) 

But i think Option[Expression] would be better logically
Once we have an agreement if we need to have a distinction or not i will 
make the changes accordingly 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15480: [SPARK-16845][SQL] `GeneratedClass$SpecificOrderi...

2017-01-09 Thread lw-lin

Github user lw-lin commented on a diff in the pull request:

https://github.com/apache/spark/pull/15480#discussion_r95306902
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateOrdering.scala
 ---
@@ -117,8 +117,31 @@ object GenerateOrdering extends 
CodeGenerator[Seq[SortOrder], Ordering[InternalR
 }
   }
   """
-}.mkString("\n")
-comparisons
+}
+
+ctx.splitExpressions(
+  expressions = comparisons,
+  funcName = "compare",
+  arguments = Seq(("InternalRow", "a"), ("InternalRow", "b")),
+  returnType = "int",
+  makeSplitFunction = { body =>
+s"""
+  InternalRow ${ctx.INPUT_ROW} = null;  // Holds current row being 
evaluated.
+  $body
+  return 0;
+"""
+  },
+  foldFunctions = { funCalls =>
+val comp = ctx.freshName("comp")
+funCalls.zipWithIndex.map { case (funCall, i) =>
+  s"""
+int ${comp}_$i = $funCall;
--- End diff --

thanks for clarifying on this. but we'll get something like (suppose we get 
a fresh name `comp_1`):
```java
/* 15388 */   public int compare(InternalRow a, InternalRow b) {
/* 15389 */
/* 15390 */ int comp_1 = compare_0(a, b); // comp_1 
/* 15391 */ if (comp_1 != 0) {
/* 15392 */   return comp_1;
/* 15393 */ }
/* 15394 */
/* 15395 */ int comp_1 = compare_1(a, b); // still comp_1 
/* 15396 */ if (comp_1 != 0) {
/* 15397 */   return comp_1;
/* 15398 */ }
/* 1 */
/* 1 */ ...
/* 1 */
/* 15450 */ return 0;
/* 15451 */   }
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16487: [SPARK-19107][SQL] support creating hive table wi...

2017-01-09 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/16487#discussion_r95306656
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala ---
@@ -469,9 +475,7 @@ class DataFrameReader private[sql](sparkSession: 
SparkSession) extends Logging {
* @since 1.4.0
*/
   def table(tableName: String): DataFrame = {
-Dataset.ofRows(sparkSession,
-  sparkSession.sessionState.catalog.lookupRelation(
-
sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName)))
+sparkSession.table(tableName)
--- End diff --

+ 1


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15480: [SPARK-16845][SQL] `GeneratedClass$SpecificOrderi...

2017-01-09 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/15480#discussion_r95306485
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateOrdering.scala
 ---
@@ -117,8 +117,31 @@ object GenerateOrdering extends 
CodeGenerator[Seq[SortOrder], Ordering[InternalR
 }
   }
   """
-}.mkString("\n")
-comparisons
+}
+
+ctx.splitExpressions(
+  expressions = comparisons,
+  funcName = "compare",
+  arguments = Seq(("InternalRow", "a"), ("InternalRow", "b")),
+  returnType = "int",
+  makeSplitFunction = { body =>
+s"""
+  InternalRow ${ctx.INPUT_ROW} = null;  // Holds current row being 
evaluated.
+  $body
+  return 0;
+"""
+  },
+  foldFunctions = { funCalls =>
+val comp = ctx.freshName("comp")
+funCalls.zipWithIndex.map { case (funCall, i) =>
+  s"""
+int ${comp}_$i = $funCall;
--- End diff --

I mean
```
val comp = ctx.freshName("comp")
funCalls.zipWithIndex.map { case (funCall, i) =>
  s"""
int $comp = $funCall;
 ...
}
```
just remove all the `_$i` here


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16233: [SPARK-18801][SQL] Support resolve a nested view

2017-01-09 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/16233#discussion_r95306192
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala ---
@@ -125,11 +132,16 @@ private[hive] class 
HiveMetastoreCatalog(sparkSession: SparkSession) extends Log
   // Otherwise, wrap the table with a Subquery using the table name.
   alias.map(a => SubqueryAlias(a, qualifiedTable, 
None)).getOrElse(qualifiedTable)
 } else if (table.tableType == CatalogTableType.VIEW) {
+  val tableIdentifier = table.identifier
   val viewText = table.viewText.getOrElse(sys.error("Invalid view 
without text."))
-  SubqueryAlias(
-alias.getOrElse(table.identifier.table),
-sparkSession.sessionState.sqlParser.parsePlan(viewText),
-Option(table.identifier))
+  // The relation is a view, so we wrap the relation by:
+  // 1. Add a [[View]] operator over the relation to keep track of the 
view desc;
+  // 2. Wrap the logical plan in a [[SubqueryAlias]] which tracks the 
name of the view.
+  val child = View(
+desc = table,
+output = table.schema.toAttributes,
+child = sparkSession.sessionState.sqlParser.parsePlan(viewText))
--- End diff --

We can even add `Cast` in project list so that analyzer can report data 
type mismatch for us automatically.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15480: [SPARK-16845][SQL] `GeneratedClass$SpecificOrderi...

2017-01-09 Thread lw-lin

Github user lw-lin commented on a diff in the pull request:

https://github.com/apache/spark/pull/15480#discussion_r95306125
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateOrdering.scala
 ---
@@ -117,8 +117,31 @@ object GenerateOrdering extends 
CodeGenerator[Seq[SortOrder], Ordering[InternalR
 }
   }
   """
-}.mkString("\n")
-comparisons
+}
+
+ctx.splitExpressions(
+  expressions = comparisons,
+  funcName = "compare",
+  arguments = Seq(("InternalRow", "a"), ("InternalRow", "b")),
+  returnType = "int",
+  makeSplitFunction = { body =>
+s"""
+  InternalRow ${ctx.INPUT_ROW} = null;  // Holds current row being 
evaluated.
+  $body
+  return 0;
+"""
+  },
+  foldFunctions = { funCalls =>
+val comp = ctx.freshName("comp")
+funCalls.zipWithIndex.map { case (funCall, i) =>
+  s"""
+int ${comp}_$i = $funCall;
--- End diff --

Ah I mis-understood it. You meant moving `ctx.freshName("comp")` into the 
`funCalls.zipWithIndex.map {...}`, right? like the following:

```scala
// val comp = ctx.freshName("comp") // moved this into the 
map {...}
funCalls.zipWithIndex.map { case (funCall, i) =>
  s"""
int ${ctx.freshName("comp")} = $funCall;
 ...
}
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16233: [SPARK-18801][SQL] Support resolve a nested view

2017-01-09 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/16233#discussion_r95306084
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala ---
@@ -125,11 +132,16 @@ private[hive] class 
HiveMetastoreCatalog(sparkSession: SparkSession) extends Log
   // Otherwise, wrap the table with a Subquery using the table name.
   alias.map(a => SubqueryAlias(a, qualifiedTable, 
None)).getOrElse(qualifiedTable)
 } else if (table.tableType == CatalogTableType.VIEW) {
+  val tableIdentifier = table.identifier
   val viewText = table.viewText.getOrElse(sys.error("Invalid view 
without text."))
-  SubqueryAlias(
-alias.getOrElse(table.identifier.table),
-sparkSession.sessionState.sqlParser.parsePlan(viewText),
-Option(table.identifier))
+  // The relation is a view, so we wrap the relation by:
+  // 1. Add a [[View]] operator over the relation to keep track of the 
view desc;
+  // 2. Wrap the logical plan in a [[SubqueryAlias]] which tracks the 
name of the view.
+  val child = View(
+desc = table,
+output = table.schema.toAttributes,
+child = sparkSession.sessionState.sqlParser.parsePlan(viewText))
--- End diff --

sorry I may have asked this question before, but I can't recall your 
answer. Why do we need the `output` field of `View`? can we do something like
```
val child = sparkSession.sessionState.sqlParser.parsePlan(viewText)
val projectList = schema.map { field =>
  Alias(UnresolvedAttribute(Seq(field.name)), field.name)(explicitMetadata 
= Some(field.metadata))
}
View(
  desc = table,
  child = Project(projectList, child))

case class View(...) {
  def output = child.output
}
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #12064: [SPARK-14272][ML] Evaluate GaussianMixtureModel with Log...

2017-01-09 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/12064
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/71113/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #12064: [SPARK-14272][ML] Evaluate GaussianMixtureModel with Log...

2017-01-09 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/12064
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16497: [SPARK-19118] [SQL] Percentile support for freque...

2017-01-09 Thread tanejagagan

Github user tanejagagan commented on a diff in the pull request:

https://github.com/apache/spark/pull/16497#discussion_r95305820
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Percentile.scala
 ---
@@ -81,7 +96,11 @@ case class Percentile(
   case arrayData: ArrayData => arrayData.toDoubleArray().toSeq
   }
 
-  override def children: Seq[Expression] = child :: percentageExpression 
:: Nil
+  override def children: Seq[Expression] = if (withFrqExpr) {
--- End diff --

I have given lot of thought to it and if we need to make a difference here.
Lets take a data set of
age_string, age, count
"20", 20, 1
"15", 15, 1
"10", 10, 1


For Sql  "select percentile( age, count , 0.5 ) from table"
logically correct values should be  
children = age::count ::0.5 :: Nil 
and 
inputType = IntegerType :: IntegerType::DoubleType::Nil

For sql "select pecentile( age, 0.5 ) from table"
logically correct values should be  
children = age::0.5 :: Nil 
and 
inputType = IntegerType ::DoubleType::Nil

Here is one example where keeping it logically correct would help 
For following incorrect SQL "select percentile( age, '10') from table" 

With children = age::'10'::Nil and inputType = IntergerType::StringType:: 
Nil 
Since both children and inputType is used for dataType validation, the 
error message would be correct as below.
"argument 2 requires Double type, however, 10 is of String type." 

However With children = age::Literal(1)::'10'::Nil and inputType = 
IntergerType::IntegerType::StringType:: Nil 
The error message would be NOT correct and confusing as below
"argument 3 requires Double type, however, 10 is of String type." 

Since both children and dataType are public method i was inclined to keep 
them explicitly correct and therefore i decided to make a difference. 
Please let me know your thoughts 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #12064: [SPARK-14272][ML] Evaluate GaussianMixtureModel with Log...

2017-01-09 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/12064
  
**[Test build #71113 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71113/testReport)**
 for PR 12064 at commit 
[`8c2d529`](https://github.com/apache/spark/commit/8c2d5291affd75081da8152cfb6a1afc7e44a656).
 * This patch **fails to build**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16233: [SPARK-18801][SQL] Support resolve a nested view

2017-01-09 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/16233#discussion_r95305635
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -510,32 +542,90 @@ class Analyzer(
* Replaces [[UnresolvedRelation]]s with concrete relations from the 
catalog.
*/
   object ResolveRelations extends Rule[LogicalPlan] {
-private def lookupTableFromCatalog(u: UnresolvedRelation): LogicalPlan 
= {
+
+// If the unresolved relation is running directly on files, we just 
return the original
+// UnresolvedRelation, the plan will get resolved later. Else we look 
up the table from catalog
+// and change the default database name if it is a view.
+// We usually look up a table from the default database if the table 
identifier has an empty
+// database part, for a view the default database should be the 
currentDb when the view was
+// created. When the case comes to resolving a nested view, the view 
may have different default
+// database with that the referenced view has, so we need to use the 
variable `defaultDatabase`
+// to track the current default database.
+// When the relation we resolve is a view, we fetch the 
view.desc(which is a CatalogTable), and
+// then set the value of `CatalogTable.viewDefaultDatabase` to the 
variable `defaultDatabase`,
+// we look up the relations that the view references using the default 
database.
+// For example:
+// |- view1 (defaultDatabase = db1)
+//   |- operator
+// |- table2 (defaultDatabase = db1)
+// |- view2 (defaultDatabase = db2)
+//|- view3 (defaultDatabase = db3)
+//   |- view4 (defaultDatabase = db4)
+// In this case, the view `view1` is a nested view, it directly 
references `table2`ã`view2`
+// and `view4`, the view `view2` references `view3`. On resolving the 
table, we look up the
+// relations `table2`ã`view2`ã`view4` using the default database 
`db1`, and look up the
+// relation `view3` using the default database `db2`.
+//
+// Note this is compatible with the views defined by older versions of 
Spark(before 2.2), which
+// have empty defaultDatabase and all the relations in viewText have 
database part defined.
+def resolveRelation(
+plan: LogicalPlan,
+defaultDatabase: Option[String] = None): LogicalPlan = plan match {
+  case u: UnresolvedRelation if 
!isRunningDirectlyOnFiles(u.tableIdentifier) =>
+val defaultDatabase = AnalysisContext.get.defaultDatabase
+val relation = lookupTableFromCatalog(u, defaultDatabase)
+resolveRelation(relation, defaultDatabase)
+  // The view's child should be a logical plan parsed from the 
`desc.viewText`, the variable
+  // `viewText` should be defined, or else we throw an error on the 
generation of the View
+  // operator.
+  case view @ View(desc, _, child) if !child.resolved =>
+val nestedViewLevel = AnalysisContext.get.nestedViewLevel + 1
+val context = AnalysisContext(defaultDatabase = 
desc.viewDefaultDatabase,
+  nestedViewLevel = nestedViewLevel)
+// Resolve all the UnresolvedRelations and Views in the child.
+val newChild = AnalysisContext.withAnalysisContext(context) {
+  execute(child)
+}
+view.copy(child = newChild)
+  case p @ SubqueryAlias(_, view: View, _) =>
+val newChild = resolveRelation(view, defaultDatabase)
+p.copy(child = newChild)
+  case _ => plan
+}
+
+def apply(plan: LogicalPlan): LogicalPlan = plan resolveOperators {
+  case i @ InsertIntoTable(u: UnresolvedRelation, parts, child, _, _) 
if child.resolved =>
+i.copy(table = EliminateSubqueryAliases(lookupTableFromCatalog(u)))
+  case u: UnresolvedRelation => resolveRelation(u)
+}
+
+// Look up the table with the given name from catalog. The database we 
look up the table from
+// is decided follow the steps:
--- End diff --

Nit: `follow` -> `following`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16233: [SPARK-18801][SQL] Support resolve a nested view

2017-01-09 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/16233#discussion_r95305370
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala
 ---
@@ -174,6 +176,7 @@ case class CatalogTable(
 stats: Option[CatalogStatistics] = None,
 viewOriginalText: Option[String] = None,
 viewText: Option[String] = None,
+viewDefaultDatabase: Option[String] = None,
--- End diff --

can we make this a table property instead of a field? something like
```
case class CatalogTable(...) {
  def viewDefaultDatabase: Opion[String] = 
properties.get(VIEW_DEFAULT_DATABASE)
}

object CatalogTable {
  val VIEW_DEFAULT_DATABASE = "view.default.database"
}
```

This is only used in a few places and I don't think it worth to add so many 
hacks in `HiveExternalCatalog`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16233: [SPARK-18801][SQL] Support resolve a nested view

2017-01-09 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/16233#discussion_r95305390
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -510,32 +542,90 @@ class Analyzer(
* Replaces [[UnresolvedRelation]]s with concrete relations from the 
catalog.
*/
   object ResolveRelations extends Rule[LogicalPlan] {
-private def lookupTableFromCatalog(u: UnresolvedRelation): LogicalPlan 
= {
+
+// If the unresolved relation is running directly on files, we just 
return the original
+// UnresolvedRelation, the plan will get resolved later. Else we look 
up the table from catalog
+// and change the default database name if it is a view.
+// We usually look up a table from the default database if the table 
identifier has an empty
+// database part, for a view the default database should be the 
currentDb when the view was
+// created. When the case comes to resolving a nested view, the view 
may have different default
+// database with that the referenced view has, so we need to use the 
variable `defaultDatabase`
+// to track the current default database.
+// When the relation we resolve is a view, we fetch the 
view.desc(which is a CatalogTable), and
+// then set the value of `CatalogTable.viewDefaultDatabase` to the 
variable `defaultDatabase`,
+// we look up the relations that the view references using the default 
database.
+// For example:
+// |- view1 (defaultDatabase = db1)
+//   |- operator
+// |- table2 (defaultDatabase = db1)
+// |- view2 (defaultDatabase = db2)
+//|- view3 (defaultDatabase = db3)
+//   |- view4 (defaultDatabase = db4)
+// In this case, the view `view1` is a nested view, it directly 
references `table2`ã`view2`
+// and `view4`, the view `view2` references `view3`. On resolving the 
table, we look up the
+// relations `table2`ã`view2`ã`view4` using the default database 
`db1`, and look up the
+// relation `view3` using the default database `db2`.
+//
+// Note this is compatible with the views defined by older versions of 
Spark(before 2.2), which
+// have empty defaultDatabase and all the relations in viewText have 
database part defined.
+def resolveRelation(
+plan: LogicalPlan,
+defaultDatabase: Option[String] = None): LogicalPlan = plan match {
+  case u: UnresolvedRelation if 
!isRunningDirectlyOnFiles(u.tableIdentifier) =>
+val defaultDatabase = AnalysisContext.get.defaultDatabase
+val relation = lookupTableFromCatalog(u, defaultDatabase)
+resolveRelation(relation, defaultDatabase)
+  // The view's child should be a logical plan parsed from the 
`desc.viewText`, the variable
+  // `viewText` should be defined, or else we throw an error on the 
generation of the View
+  // operator.
+  case view @ View(desc, _, child) if !child.resolved =>
+val nestedViewLevel = AnalysisContext.get.nestedViewLevel + 1
+val context = AnalysisContext(defaultDatabase = 
desc.viewDefaultDatabase,
+  nestedViewLevel = nestedViewLevel)
+// Resolve all the UnresolvedRelations and Views in the child.
+val newChild = AnalysisContext.withAnalysisContext(context) {
+  execute(child)
+}
--- End diff --

Like 
[withWatrmark](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L572),
 we can simplify the above code by changing the interface of 
`withAnalysisContext`:
```Scala
val newChild = 
AnalysisContext.withAnalysisContext(desc.viewDefaultDatabase) {
  execute(child)
}
```

The `nestedViewLevel` can be incremented by one in `withAnalysisContext`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #12064: [SPARK-14272][ML] Evaluate GaussianMixtureModel with Log...

2017-01-09 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/12064
  
**[Test build #71113 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71113/testReport)**
 for PR 12064 at commit 
[`8c2d529`](https://github.com/apache/spark/commit/8c2d5291affd75081da8152cfb6a1afc7e44a656).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15505: [SPARK-18890][CORE] Move task serialization from ...

2017-01-09 Thread witgo

Github user witgo commented on a diff in the pull request:

https://github.com/apache/spark/pull/15505#discussion_r95305125
  
--- Diff: 
core/src/test/scala/org/apache/spark/scheduler/TaskSetManagerSuite.scala ---
@@ -592,47 +579,6 @@ class TaskSetManagerSuite extends SparkFunSuite with 
LocalSparkContext with Logg
 assert(manager.resourceOffer("execB", "host2", RACK_LOCAL).get.index 
=== 1)
   }
 
-  test("do not emit warning when serialized task is small") {
--- End diff --

Ok, I will add this test case back.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15505: [SPARK-18890][CORE] Move task serialization from the Tas...

2017-01-09 Thread witgo

Github user witgo commented on the issue:

https://github.com/apache/spark/pull/15505
  
@squito In the local mode, the performance is relatively less important, we 
only guarantee that there will be no performance degradation on it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16233: [SPARK-18801][SQL] Support resolve a nested view

2017-01-09 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/16233#discussion_r95304793
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/view.scala 
---
@@ -0,0 +1,89 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.analysis
+
+import org.apache.spark.sql.AnalysisException
+import org.apache.spark.sql.catalyst.CatalystConf
+import org.apache.spark.sql.catalyst.expressions.{Alias, Attribute, Cast}
+import org.apache.spark.sql.catalyst.plans.logical.{LogicalPlan, Project, 
View}
+import org.apache.spark.sql.catalyst.rules.Rule
+
+/**
+ * This file defines analysis rules related to views.
+ */
+
+/**
+ * Make sure that a view's child plan produces the view's output 
attributes. We wrap the child
+ * with a Project and add an alias for each output attribute. The 
attributes are resolved by
+ * name. This should be only done after the resolution batch, because the 
view attributes are
+ * not stable during resolution.
+ */
+case class AliasViewChild(conf: CatalystConf) extends Rule[LogicalPlan] {
+  override def apply(plan: LogicalPlan): LogicalPlan = plan 
resolveOperators {
+case v @ View(_, output, child) if child.resolved =>
+  val resolver = conf.resolver
+  val newOutput = child.output.map { attr =>
+val newAttr = findAttributeByName(attr.name, output, resolver)
+// Check the dataType of the output attributes, throw an 
AnalysisException if they don't
+// match up.
+checkDataType(attr, newAttr)
+Alias(attr, attr.name)(exprId = newAttr.exprId, qualifier = 
newAttr.qualifier,
+  explicitMetadata = Some(newAttr.metadata))
+  }
+  v.copy(child = Project(newOutput, child))
+  }
+
+  /**
+   * Find the attribute that has the expected attribute name from an 
attribute list, the names
+   * are compared using conf.resolver.
+   * If the expected attribute is not found, throw an AnalysisException.
+   */
+  private def findAttributeByName(
+  name: String,
+  attrs: Seq[Attribute],
+  resolver: Resolver): Attribute = {
+attrs.collectFirst {
+  case attr if resolver(attr.name, name) => attr
+}.getOrElse(throw new AnalysisException(
+  s"Attribute with name '$name' is not found in " +
+s"'${attrs.map(_.name).mkString("(", ",", ")")}'"))
+  }
+
+  /**
+   * Check whether the dataType of `attr` could be casted to that of 
`other`, throw an
+   * AnalysisException if the both attributes don't match up.
+   */
+  private def checkDataType(attr: Attribute, other: Attribute): Unit = {
+if (!Cast.canCast(attr.dataType, other.dataType)) {
--- End diff --

how did we handle view schema mismatch before?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16233: [SPARK-18801][SQL] Support resolve a nested view

2017-01-09 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/16233#discussion_r95304745
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -510,32 +542,90 @@ class Analyzer(
* Replaces [[UnresolvedRelation]]s with concrete relations from the 
catalog.
*/
   object ResolveRelations extends Rule[LogicalPlan] {
-private def lookupTableFromCatalog(u: UnresolvedRelation): LogicalPlan 
= {
+
+// If the unresolved relation is running directly on files, we just 
return the original
+// UnresolvedRelation, the plan will get resolved later. Else we look 
up the table from catalog
+// and change the default database name if it is a view.
+// We usually look up a table from the default database if the table 
identifier has an empty
+// database part, for a view the default database should be the 
currentDb when the view was
+// created. When the case comes to resolving a nested view, the view 
may have different default
+// database with that the referenced view has, so we need to use the 
variable `defaultDatabase`
+// to track the current default database.
+// When the relation we resolve is a view, we fetch the 
view.desc(which is a CatalogTable), and
+// then set the value of `CatalogTable.viewDefaultDatabase` to the 
variable `defaultDatabase`,
+// we look up the relations that the view references using the default 
database.
+// For example:
+// |- view1 (defaultDatabase = db1)
+//   |- operator
+// |- table2 (defaultDatabase = db1)
+// |- view2 (defaultDatabase = db2)
+//|- view3 (defaultDatabase = db3)
+//   |- view4 (defaultDatabase = db4)
+// In this case, the view `view1` is a nested view, it directly 
references `table2`ã`view2`
+// and `view4`, the view `view2` references `view3`. On resolving the 
table, we look up the
+// relations `table2`ã`view2`ã`view4` using the default database 
`db1`, and look up the
+// relation `view3` using the default database `db2`.
+//
+// Note this is compatible with the views defined by older versions of 
Spark(before 2.2), which
+// have empty defaultDatabase and all the relations in viewText have 
database part defined.
+def resolveRelation(
+plan: LogicalPlan,
+defaultDatabase: Option[String] = None): LogicalPlan = plan match {
+  case u: UnresolvedRelation if 
!isRunningDirectlyOnFiles(u.tableIdentifier) =>
+val defaultDatabase = AnalysisContext.get.defaultDatabase
+val relation = lookupTableFromCatalog(u, defaultDatabase)
+resolveRelation(relation, defaultDatabase)
+  // The view's child should be a logical plan parsed from the 
`desc.viewText`, the variable
+  // `viewText` should be defined, or else we throw an error on the 
generation of the View
+  // operator.
+  case view @ View(desc, _, child) if !child.resolved =>
+val nestedViewLevel = AnalysisContext.get.nestedViewLevel + 1
+val context = AnalysisContext(defaultDatabase = 
desc.viewDefaultDatabase,
+  nestedViewLevel = nestedViewLevel)
+// Resolve all the UnresolvedRelations and Views in the child.
+val newChild = AnalysisContext.withAnalysisContext(context) {
+  execute(child)
+}
+view.copy(child = newChild)
+  case p @ SubqueryAlias(_, view: View, _) =>
+val newChild = resolveRelation(view, defaultDatabase)
+p.copy(child = newChild)
+  case _ => plan
+}
+
+def apply(plan: LogicalPlan): LogicalPlan = plan resolveOperators {
+  case i @ InsertIntoTable(u: UnresolvedRelation, parts, child, _, _) 
if child.resolved =>
+i.copy(table = EliminateSubqueryAliases(lookupTableFromCatalog(u)))
+  case u: UnresolvedRelation => resolveRelation(u)
+}
+
+// Look up the table with the given name from catalog. The database we 
look up the table from
+// is decided follow the steps:
+// 1. If the database part is defined in the table identifier, use 
that database name;
+// 2. Else If the defaultDatabase is defined, use the default database 
name(In this case, no
+//temporary objects can be used, and the default database is only 
used to look up a view);
+// 3. Else use the currentDb of the SessionCatalog.
+private def lookupTableFromCatalog(
+u: UnresolvedRelation,
+defaultDatabase: Option[String] = None): LogicalPlan = {
   try {
-catalog.lookupRelation(u.tableIdentifier, u.alias)
+val tableIdentWithDb = 
u.tableIdentifier.withDatabase(defaultDatabase)

[GitHub] spark issue #16415: [SPARK-19007]Speedup and optimize the GradientBoostedTre...

2017-01-09 Thread zdh2292390

Github user zdh2292390 commented on the issue:

https://github.com/apache/spark/pull/16415
  
@jkbradley @srowen   Can anyone check my latest commit  please?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16233: [SPARK-18801][SQL] Support resolve a nested view

2017-01-09 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/16233#discussion_r95304509
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/view.scala 
---
@@ -0,0 +1,90 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.analysis
+
+import org.apache.spark.sql.AnalysisException
+import org.apache.spark.sql.catalyst.CatalystConf
+import org.apache.spark.sql.catalyst.expressions.{Alias, Attribute, Cast}
+import org.apache.spark.sql.catalyst.plans.logical.{LogicalPlan, Project, 
View}
+import org.apache.spark.sql.catalyst.rules.Rule
+
+/**
+ * This file defines analysis rules related to views.
+ */
+
+/**
+ * Make sure that a view's child plan produces the view's output 
attributes. We wrap the child
+ * with a Project and add an alias for each output attribute. The 
attributes are resolved by
+ * name. This should be only done after the resolution batch, because the 
view attributes are
+ * not stable during resolution.
+ */
+case class AliasViewChild(conf: CatalystConf) extends Rule[LogicalPlan] {
+  override def apply(plan: LogicalPlan): LogicalPlan = plan 
resolveOperators {
+case v @ View(_, output, child) if child.resolved =>
+  val resolver = conf.resolver
+  val newOutput = child.output.map { attr =>
+val newAttr = findAttributeByName(attr.name, output, resolver)
+// Check the dataType of the output attributes, throw an 
AnalysisException if they don't
+// match up.
+checkDataType(attr, newAttr)
+Alias(attr, attr.name)(exprId = newAttr.exprId, qualifier = 
newAttr.qualifier,
+  explicitMetadata = Some(newAttr.metadata))
+  }
+  v.copy(child = Project(newOutput, child))
+  }
+
+  /**
+   * Find the attribute that has the expected attribute name from an 
attribute list, the names
+   * are compared using conf.resolver.
+   * If the expected attribute is not found, throw an AnalysisException.
+   */
+  private def findAttributeByName(
+  name: String,
+  attrs: Seq[Attribute],
+  resolver: Resolver): Attribute = {
+attrs.collectFirst {
+  case attr if resolver(attr.name, name) => attr
+}.getOrElse(throw new AnalysisException(
+  s"Attribute with name '$name' is not found in " +
+s"'${attrs.map(_.name).mkString("(", ",", ")")}'"))
+  }
+
+  /**
+   * Check whether the dataType of `attr` could be casted to that of 
`other`, throw an
+   * AnalysisException if the both attributes don't match up.
+   */
+  private def checkDataType(attr: Attribute, other: Attribute): Unit = {
+if (!Cast.canCast(attr.dataType, other.dataType)) {
+  throw new AnalysisException(
+s"The dataType of attribute '$other' is '${other.dataType}', which 
can't be casted to " +
+  s"that of '$attr', expected '${attr.dataType}'.")
+}
+  }
+}
+
+/**
+ * Removes [[View]] operators from the plan. The operator is respected 
till the end of analysis
+ * stage because we want to see which part of a analyzed logical plan is 
generated from a view.
+ */
+object EliminateView extends Rule[LogicalPlan] {
+  def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {
+// The child should have the same output attributes with the View 
operator, so we simply
--- End diff --

> The child should have the same output attributes with the View operator

shall we add an assert?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail:

[GitHub] spark pull request #16233: [SPARK-18801][SQL] Support resolve a nested view

2017-01-09 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/16233#discussion_r95304037
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -510,32 +542,90 @@ class Analyzer(
* Replaces [[UnresolvedRelation]]s with concrete relations from the 
catalog.
*/
   object ResolveRelations extends Rule[LogicalPlan] {
-private def lookupTableFromCatalog(u: UnresolvedRelation): LogicalPlan 
= {
+
+// If the unresolved relation is running directly on files, we just 
return the original
+// UnresolvedRelation, the plan will get resolved later. Else we look 
up the table from catalog
+// and change the default database name if it is a view.
+// We usually look up a table from the default database if the table 
identifier has an empty
+// database part, for a view the default database should be the 
currentDb when the view was
+// created. When the case comes to resolving a nested view, the view 
may have different default
+// database with that the referenced view has, so we need to use the 
variable `defaultDatabase`
+// to track the current default database.
+// When the relation we resolve is a view, we fetch the 
view.desc(which is a CatalogTable), and
+// then set the value of `CatalogTable.viewDefaultDatabase` to the 
variable `defaultDatabase`,
+// we look up the relations that the view references using the default 
database.
+// For example:
+// |- view1 (defaultDatabase = db1)
+//   |- operator
+// |- table2 (defaultDatabase = db1)
+// |- view2 (defaultDatabase = db2)
+//|- view3 (defaultDatabase = db3)
+//   |- view4 (defaultDatabase = db4)
+// In this case, the view `view1` is a nested view, it directly 
references `table2`ã`view2`
+// and `view4`, the view `view2` references `view3`. On resolving the 
table, we look up the
+// relations `table2`ã`view2`ã`view4` using the default database 
`db1`, and look up the
+// relation `view3` using the default database `db2`.
+//
+// Note this is compatible with the views defined by older versions of 
Spark(before 2.2), which
+// have empty defaultDatabase and all the relations in viewText have 
database part defined.
+def resolveRelation(
+plan: LogicalPlan,
+defaultDatabase: Option[String] = None): LogicalPlan = plan match {
+  case u: UnresolvedRelation if 
!isRunningDirectlyOnFiles(u.tableIdentifier) =>
+val defaultDatabase = AnalysisContext.get.defaultDatabase
+val relation = lookupTableFromCatalog(u, defaultDatabase)
+resolveRelation(relation, defaultDatabase)
+  // The view's child should be a logical plan parsed from the 
`desc.viewText`, the variable
+  // `viewText` should be defined, or else we throw an error on the 
generation of the View
+  // operator.
+  case view @ View(desc, _, child) if !child.resolved =>
+val nestedViewLevel = AnalysisContext.get.nestedViewLevel + 1
+val context = AnalysisContext(defaultDatabase = 
desc.viewDefaultDatabase,
+  nestedViewLevel = nestedViewLevel)
+// Resolve all the UnresolvedRelations and Views in the child.
+val newChild = AnalysisContext.withAnalysisContext(context) {
+  execute(child)
+}
+view.copy(child = newChild)
+  case p @ SubqueryAlias(_, view: View, _) =>
--- End diff --

why we have this case?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15505: [SPARK-18890][CORE] Move task serialization from ...

2017-01-09 Thread witgo

Github user witgo commented on a diff in the pull request:

https://github.com/apache/spark/pull/15505#discussion_r95304052
  
--- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala ---
@@ -149,7 +149,12 @@ private[spark] object Utils extends Logging {
 
   /** Deserialize an object using Java serialization and the given 
ClassLoader */
   def deserialize[T](bytes: Array[Byte], loader: ClassLoader): T = {
-val bis = new ByteArrayInputStream(bytes)
+deserialize(ByteBuffer.wrap(bytes), loader)
+  }
+
+  /** Deserialize an object using Java serialization and the given 
ClassLoader */
+  def deserialize[T](bytes: ByteBuffer, loader: ClassLoader): T = {
+val bis = new ByteBufferInputStream(bytes)
--- End diff --

I don't think there's a problem here. it is covered by many test cases.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16233: [SPARK-18801][SQL] Support resolve a nested view

2017-01-09 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/16233#discussion_r95303945
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -510,32 +542,90 @@ class Analyzer(
* Replaces [[UnresolvedRelation]]s with concrete relations from the 
catalog.
*/
   object ResolveRelations extends Rule[LogicalPlan] {
-private def lookupTableFromCatalog(u: UnresolvedRelation): LogicalPlan 
= {
+
+// If the unresolved relation is running directly on files, we just 
return the original
+// UnresolvedRelation, the plan will get resolved later. Else we look 
up the table from catalog
+// and change the default database name if it is a view.
+// We usually look up a table from the default database if the table 
identifier has an empty
+// database part, for a view the default database should be the 
currentDb when the view was
+// created. When the case comes to resolving a nested view, the view 
may have different default
+// database with that the referenced view has, so we need to use the 
variable `defaultDatabase`
+// to track the current default database.
+// When the relation we resolve is a view, we fetch the 
view.desc(which is a CatalogTable), and
+// then set the value of `CatalogTable.viewDefaultDatabase` to the 
variable `defaultDatabase`,
+// we look up the relations that the view references using the default 
database.
+// For example:
+// |- view1 (defaultDatabase = db1)
+//   |- operator
+// |- table2 (defaultDatabase = db1)
+// |- view2 (defaultDatabase = db2)
+//|- view3 (defaultDatabase = db3)
+//   |- view4 (defaultDatabase = db4)
+// In this case, the view `view1` is a nested view, it directly 
references `table2`ã`view2`
+// and `view4`, the view `view2` references `view3`. On resolving the 
table, we look up the
+// relations `table2`ã`view2`ã`view4` using the default database 
`db1`, and look up the
+// relation `view3` using the default database `db2`.
+//
+// Note this is compatible with the views defined by older versions of 
Spark(before 2.2), which
+// have empty defaultDatabase and all the relations in viewText have 
database part defined.
+def resolveRelation(
+plan: LogicalPlan,
+defaultDatabase: Option[String] = None): LogicalPlan = plan match {
+  case u: UnresolvedRelation if 
!isRunningDirectlyOnFiles(u.tableIdentifier) =>
+val defaultDatabase = AnalysisContext.get.defaultDatabase
+val relation = lookupTableFromCatalog(u, defaultDatabase)
+resolveRelation(relation, defaultDatabase)
+  // The view's child should be a logical plan parsed from the 
`desc.viewText`, the variable
+  // `viewText` should be defined, or else we throw an error on the 
generation of the View
+  // operator.
+  case view @ View(desc, _, child) if !child.resolved =>
+val nestedViewLevel = AnalysisContext.get.nestedViewLevel + 1
+val context = AnalysisContext(defaultDatabase = 
desc.viewDefaultDatabase,
+  nestedViewLevel = nestedViewLevel)
+// Resolve all the UnresolvedRelations and Views in the child.
+val newChild = AnalysisContext.withAnalysisContext(context) {
+  execute(child)
+}
+view.copy(child = newChild)
+  case p @ SubqueryAlias(_, view: View, _) =>
+val newChild = resolveRelation(view, defaultDatabase)
+p.copy(child = newChild)
+  case _ => plan
+}
+
+def apply(plan: LogicalPlan): LogicalPlan = plan resolveOperators {
+  case i @ InsertIntoTable(u: UnresolvedRelation, parts, child, _, _) 
if child.resolved =>
+i.copy(table = EliminateSubqueryAliases(lookupTableFromCatalog(u)))
+  case u: UnresolvedRelation => resolveRelation(u)
+}
+
+// Look up the table with the given name from catalog. The database we 
look up the table from
+// is decided follow the steps:
+// 1. If the database part is defined in the table identifier, use 
that database name;
+// 2. Else If the defaultDatabase is defined, use the default database 
name(In this case, no
+//temporary objects can be used, and the default database is only 
used to look up a view);
+// 3. Else use the currentDb of the SessionCatalog.
--- End diff --

Looks like the current code doesn't match these comments well, shall we 
update the comments?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature

[GitHub] spark pull request #16487: [SPARK-19107][SQL] support creating hive table wi...

2017-01-09 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/16487#discussion_r95303491
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/MetastoreDataSourcesSuite.scala
 ---
@@ -1169,26 +1169,6 @@ class MetastoreDataSourcesSuite extends QueryTest 
with SQLTestUtils with TestHiv
 }
   }
 
-  test("save API - format hive") {
--- End diff --

already covered by 
https://github.com/apache/spark/pull/16487/files#diff-b7094baa12601424a5d19cb930e3402fR1356


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16487: [SPARK-19107][SQL] support creating hive table with Data...

2017-01-09 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16487
  
**[Test build #71112 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71112/testReport)**
 for PR 16487 at commit 
[`27d97e5`](https://github.com/apache/spark/commit/27d97e56295465bb717da9ff08d2571998adf2b1).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16497: [SPARK-19118] [SQL] Percentile support for freque...

2017-01-09 Thread tanejagagan

Github user tanejagagan commented on a diff in the pull request:

https://github.com/apache/spark/pull/16497#discussion_r95303304
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Percentile.scala
 ---
@@ -126,10 +152,15 @@ case class Percentile(
   buffer: OpenHashMap[Number, Long],
   input: InternalRow): OpenHashMap[Number, Long] = {
 val key = child.eval(input).asInstanceOf[Number]
+val frqValue = frequency.eval(input)
 
 // Null values are ignored in counts map.
-if (key != null) {
-  buffer.changeValue(key, 1L, _ + 1L)
+if (key != null && frqValue != null) {
--- End diff --

Yes this would be wrong to use the default value of 1 
Let take a data set of 
Age, Count 
20, 1 
15, 1 
10, 0

If we take the default value of 1L when the frq is 0  is then .5 percentile 
would become 15 . This is incorrect. I agree with other suggestion of either 
failing or disregard those values



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15480: [SPARK-16845][SQL] `GeneratedClass$SpecificOrderi...

2017-01-09 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/15480#discussion_r95303089
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateOrdering.scala
 ---
@@ -117,8 +117,31 @@ object GenerateOrdering extends 
CodeGenerator[Seq[SortOrder], Ordering[InternalR
 }
   }
   """
-}.mkString("\n")
-comparisons
+}
+
+ctx.splitExpressions(
+  expressions = comparisons,
+  funcName = "compare",
+  arguments = Seq(("InternalRow", "a"), ("InternalRow", "b")),
+  returnType = "int",
+  makeSplitFunction = { body =>
+s"""
+  InternalRow ${ctx.INPUT_ROW} = null;  // Holds current row being 
evaluated.
+  $body
+  return 0;
+"""
+  },
+  foldFunctions = { funCalls =>
+val comp = ctx.freshName("comp")
+funCalls.zipWithIndex.map { case (funCall, i) =>
+  s"""
+int ${comp}_$i = $funCall;
--- End diff --

can you double check? the implementation of `freshName` is
```
if (freshNameIds.contains(fullName)) {
  val id = freshNameIds(fullName)
  freshNameIds(fullName) = id + 1
  s"$fullName$id"
} else {
  freshNameIds += fullName -> 1
  fullName
}
```
it already adds an id postfix.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16497: [SPARK-19118] [SQL] Percentile support for freque...

2017-01-09 Thread tanejagagan

Github user tanejagagan commented on a diff in the pull request:

https://github.com/apache/spark/pull/16497#discussion_r95303054
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Percentile.scala
 ---
@@ -126,10 +152,15 @@ case class Percentile(
   buffer: OpenHashMap[Number, Long],
   input: InternalRow): OpenHashMap[Number, Long] = {
 val key = child.eval(input).asInstanceOf[Number]
+val frqValue = frequency.eval(input)
 
 // Null values are ignored in counts map.
-if (key != null) {
-  buffer.changeValue(key, 1L, _ + 1L)
+if (key != null && frqValue != null) {
+  val frqLong = frqValue.asInstanceOf[Number].longValue()
+  // add only when frequency is positive
+  if (frqLong > 0) {
--- End diff --

I  think the option was between either fail or disregard those values. We 
can certainly make this a requirement, document and fail when the values are 
negatives
I think for the cases where values are either null or 0 we should not be 
adding them to Map to unnecessary
 bloat the map.
The logic would look like 
if ( frqLong < 0 ) {
  throw new SomeException
}else if( frqLong > 0 ) {
  // process to add them to map 
}

Let me know if above look good and i will make the changes accordingly


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15413: [SPARK-17847][ML] Reduce shuffled data size of Ga...

2017-01-09 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/15413


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15413: [SPARK-17847][ML] Reduce shuffled data size of GaussianM...

2017-01-09 Thread yanboliang

Github user yanboliang commented on the issue:

https://github.com/apache/spark/pull/15413
  
Merged into master. Thanks for all reviews.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15413: [SPARK-17847][ML] Reduce shuffled data size of Ga...

2017-01-09 Thread yanboliang

Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/15413#discussion_r95302720
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/clustering/GaussianMixtureSuite.scala 
---
@@ -126,9 +143,93 @@ class GaussianMixtureSuite extends SparkFunSuite with 
MLlibTestSparkContext
 testEstimatorAndModelReadWrite(gm, dataset,
   GaussianMixtureSuite.allParamSettings, checkModelData)
   }
+
+  test("univariate dense/sparse data with two clusters") {
+val weights = Array(2.0 / 3.0, 1.0 / 3.0)
+val means = Array(Vectors.dense(5.1604), Vectors.dense(-4.3673))
+val covs = Array(Matrices.dense(1, 1, Array(0.86644)), 
Matrices.dense(1, 1, Array(1.1098)))
+val gaussians = means.zip(covs).map { case (mean, cov) =>
+  new MultivariateGaussian(mean, cov)
+}
+val expected = new GaussianMixtureModel("dummy", weights, gaussians)
+
+Seq(denseDataset, sparseDataset).foreach { dataset =>
+  val actual = new GaussianMixture().setK(2).setSeed(seed).fit(dataset)
+  modelEquals(expected, actual)
+}
+  }
+
+  test("check distributed decomposition") {
--- End diff --

Yeah, I also suffer from bad initialization in some of my use cases. So I 
think we should push to commit 
[SPARK-15785](https://issues.apache.org/jira/browse/SPARK-15785) firstly. It's 
more easy to add correctness test after we support initial model. I'll leave 
this as follow up and open 
[SPARK-19144](https://issues.apache.org/jira/browse/SPARK-19144) to track. 
Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16233: [SPARK-18801][SQL] Support resolve a nested view

2017-01-09 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/16233
  
Sorry, I have not finished the review. Need to buy a milk for my baby. Will 
continue the review soon. : )


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16233: [SPARK-18801][SQL] Support resolve a nested view

2017-01-09 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/16233#discussion_r95302345
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -50,6 +50,36 @@ object SimpleAnalyzer extends Analyzer(
 new SimpleCatalystConf(caseSensitiveAnalysis = true))
 
 /**
+ * Provides a way to keep state during the analysis, this enables us to 
decouple the concerns
+ * of analysis environment from the catalog.
+ *
+ * Note this is thread local.
+ *
+ * @param defaultDatabase The default database used in the view 
resolution, this overrules the
+ *current catalog database.
+ * @param nestedViewLevel The nested level in the view resolution, this 
enables us to limit the
+ *depth of nested views.
+ */
+case class AnalysisContext(
+defaultDatabase: Option[String] = None,
+nestedViewLevel: Int = 0)
+
+object AnalysisContext {
+  private val value = new ThreadLocal[AnalysisContext]() {
+override def initialValue: AnalysisContext = AnalysisContext()
+  }
+
+  def get: AnalysisContext = value.get()
+  def set(context: AnalysisContext): Unit = value.set(context)
--- End diff --

This function is private, right?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16233: [SPARK-18801][SQL] Support resolve a nested view

2017-01-09 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/16233#discussion_r95302084
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -510,32 +542,90 @@ class Analyzer(
* Replaces [[UnresolvedRelation]]s with concrete relations from the 
catalog.
*/
   object ResolveRelations extends Rule[LogicalPlan] {
-private def lookupTableFromCatalog(u: UnresolvedRelation): LogicalPlan 
= {
+
+// If the unresolved relation is running directly on files, we just 
return the original
+// UnresolvedRelation, the plan will get resolved later. Else we look 
up the table from catalog
+// and change the default database name if it is a view.
+// We usually look up a table from the default database if the table 
identifier has an empty
+// database part, for a view the default database should be the 
currentDb when the view was
+// created. When the case comes to resolving a nested view, the view 
may have different default
+// database with that the referenced view has, so we need to use the 
variable `defaultDatabase`
+// to track the current default database.
+// When the relation we resolve is a view, we fetch the 
view.desc(which is a CatalogTable), and
+// then set the value of `CatalogTable.viewDefaultDatabase` to the 
variable `defaultDatabase`,
+// we look up the relations that the view references using the default 
database.
+// For example:
+// |- view1 (defaultDatabase = db1)
+//   |- operator
+// |- table2 (defaultDatabase = db1)
+// |- view2 (defaultDatabase = db2)
+//|- view3 (defaultDatabase = db3)
+//   |- view4 (defaultDatabase = db4)
+// In this case, the view `view1` is a nested view, it directly 
references `table2`ã`view2`
+// and `view4`, the view `view2` references `view3`. On resolving the 
table, we look up the
+// relations `table2`ã`view2`ã`view4` using the default database 
`db1`, and look up the
+// relation `view3` using the default database `db2`.
+//
+// Note this is compatible with the views defined by older versions of 
Spark(before 2.2), which
+// have empty defaultDatabase and all the relations in viewText have 
database part defined.
+def resolveRelation(
+plan: LogicalPlan,
+defaultDatabase: Option[String] = None): LogicalPlan = plan match {
--- End diff --

We already introduce `AnalysisContext`. It does not need this parm, right?
 
```Scala
def resolveRelation(plan: LogicalPlan): LogicalPlan = plan match {
  case u: UnresolvedRelation if 
!isRunningDirectlyOnFiles(u.tableIdentifier) =>
val defaultDatabase = AnalysisContext.get.defaultDatabase
val relation = lookupTableFromCatalog(u, defaultDatabase)
resolveRelation(relation)
  // The view's child should be a logical plan parsed from the 
`desc.viewText`, the variable
  // `viewText` should be defined, or else we throw an error on the 
generation of the View
  // operator.
  case view @ View(desc, _, child) if !child.resolved =>
val nestedViewLevel = AnalysisContext.get.nestedViewLevel + 1
val context = AnalysisContext(defaultDatabase = 
desc.viewDefaultDatabase,
  nestedViewLevel = nestedViewLevel)
// Resolve all the UnresolvedRelations and Views in the child.
val newChild = AnalysisContext.withAnalysisContext(context) {
  execute(child)
}
view.copy(child = newChild)
  case p @ SubqueryAlias(_, view: View, _) =>
val newChild = resolveRelation(view)
p.copy(child = newChild)
  case _ => plan
}
```

Please also update the function comment. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16465: [SPARK-19064][PySpark]Fix pip installing of sub componen...

2017-01-09 Thread holdenk

Github user holdenk commented on the issue:

https://github.com/apache/spark/pull/16465
  
Gentle ping for @joshrosen to @davies maybe?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16487: [SPARK-19107][SQL] support creating hive table with Data...

2017-01-09 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16487
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16487: [SPARK-19107][SQL] support creating hive table with Data...

2017-01-09 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16487
  
**[Test build #71110 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71110/testReport)**
 for PR 16487 at commit 
[`6209d04`](https://github.com/apache/spark/commit/6209d04709aee523b8fd43652fbcc87c8908e168).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16487: [SPARK-19107][SQL] support creating hive table with Data...

2017-01-09 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16487
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/71110/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14725: [SPARK-17161] [PYSPARK][ML] Add PySpark-ML JavaWrapper c...

2017-01-09 Thread holdenk

Github user holdenk commented on the issue:

https://github.com/apache/spark/pull/14725
  
I don't think the wrappers are public APIs per-se, but I agree reducing the 
amount of boilerplate scala code required to expose the ML stuff is good if we 
can make it robust :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15480: [SPARK-16845][SQL] `GeneratedClass$SpecificOrderi...

2017-01-09 Thread lw-lin

Github user lw-lin commented on a diff in the pull request:

https://github.com/apache/spark/pull/15480#discussion_r95300754
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/OrderingSuite.scala
 ---
@@ -127,4 +127,17 @@ class OrderingSuite extends SparkFunSuite with 
ExpressionEvalHelper {
   }
 }
   }
+
+  test("SPARK-16845: GeneratedClass$SpecificOrdering grows beyond 64 KB") {
+val sortOrder = Literal("abc").asc
+
+// this is passing prior to SPARK-16845, and it should also be passing 
after SPARK-16845
+GenerateOrdering.generate(Array.fill(40)(sortOrder))
+
+// this is FAILING prior to SPARK-16845, but it should be passing 
after SPARK-16845
+GenerateOrdering.generate(Array.fill(450)(sortOrder))
--- End diff --

Sure, I'll remove the `450` test case


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15480: [SPARK-16845][SQL] `GeneratedClass$SpecificOrderi...

2017-01-09 Thread lw-lin

Github user lw-lin commented on a diff in the pull request:

https://github.com/apache/spark/pull/15480#discussion_r95300735
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateOrdering.scala
 ---
@@ -117,8 +117,31 @@ object GenerateOrdering extends 
CodeGenerator[Seq[SortOrder], Ordering[InternalR
 }
   }
   """
-}.mkString("\n")
-comparisons
+}
+
+ctx.splitExpressions(
+  expressions = comparisons,
+  funcName = "compare",
+  arguments = Seq(("InternalRow", "a"), ("InternalRow", "b")),
+  returnType = "int",
+  makeSplitFunction = { body =>
+s"""
+  InternalRow ${ctx.INPUT_ROW} = null;  // Holds current row being 
evaluated.
+  $body
+  return 0;
+"""
+  },
+  foldFunctions = { funCalls =>
+val comp = ctx.freshName("comp")
+funCalls.zipWithIndex.map { case (funCall, i) =>
+  s"""
+int ${comp}_$i = $funCall;
--- End diff --

It's `comp_0`, `comp_1` in the following:
```scala
/* 15388 */   public int compare(InternalRow a, InternalRow b) {
/* 15389 */
/* 15390 */ int comp_0 = compare_0(a, b);
/* 15391 */ if (comp_0 != 0) {
/* 15392 */   return comp_0;
/* 15393 */ }
/* 15394 */
/* 15395 */ int comp_1 = compare_1(a, b);
/* 15396 */ if (comp_1 != 0) {
/* 15397 */   return comp_1;
/* 15398 */ }
/* 1 */
/* 1 */ ...
/* 1 */
/* 15450 */ return 0;
/* 15451 */   }
```
so maybe we should keep this `_$i`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16514: [SPARK-19128] [SQL] Refresh Cache after Set Location

2017-01-09 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16514
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/71109/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16514: [SPARK-19128] [SQL] Refresh Cache after Set Location

2017-01-09 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16514
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16233: [SPARK-18801][SQL] Support resolve a nested view

2017-01-09 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/16233#discussion_r95300580
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -50,6 +50,36 @@ object SimpleAnalyzer extends Analyzer(
 new SimpleCatalystConf(caseSensitiveAnalysis = true))
 
 /**
+ * Provides a way to keep state during the analysis, this enables us to 
decouple the concerns
+ * of analysis environment from the catalog.
+ *
+ * Note this is thread local.
+ *
+ * @param defaultDatabase The default database used in the view 
resolution, this overrules the
+ *current catalog database.
+ * @param nestedViewLevel The nested level in the view resolution, this 
enables us to limit the
+ *depth of nested views.
--- End diff --

This is not being used, right? If so, could you update the comment? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16514: [SPARK-19128] [SQL] Refresh Cache after Set Location

2017-01-09 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16514
  
**[Test build #71109 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71109/testReport)**
 for PR 16514 at commit 
[`0805cf6`](https://github.com/apache/spark/commit/0805cf6300953ac813ca9d10b76948a42af02a95).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16255: [SPARK-18609][SQL]Fix when CTE with Join between ...

2017-01-09 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/16255#discussion_r95300466
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -200,6 +200,8 @@ object RemoveAliasOnlyProject extends Rule[LogicalPlan] 
{
 case plan: Project if plan eq proj => plan.child
 case plan => plan transformExpressions {
   case a: Attribute if attrMap.contains(a) => attrMap(a)
+  case b: Alias if attrMap.exists(_._1.exprId == b.exprId)
+&& b.child.isInstanceOf[NamedExpression] => b.child
--- End diff --

I don't get it, for `alias a#1 to a#2`, we wanna `replace all a#2 with 
a#1`, so we will do nothing for `alias a#1 to a#2`, because we can't find an 
attribute `a#2`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15480: [SPARK-16845][SQL] `GeneratedClass$SpecificOrderi...

2017-01-09 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/15480#discussion_r95300097
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/OrderingSuite.scala
 ---
@@ -127,4 +127,17 @@ class OrderingSuite extends SparkFunSuite with 
ExpressionEvalHelper {
   }
 }
   }
+
+  test("SPARK-16845: GeneratedClass$SpecificOrdering grows beyond 64 KB") {
+val sortOrder = Literal("abc").asc
+
+// this is passing prior to SPARK-16845, and it should also be passing 
after SPARK-16845
+GenerateOrdering.generate(Array.fill(40)(sortOrder))
+
+// this is FAILING prior to SPARK-16845, but it should be passing 
after SPARK-16845
+GenerateOrdering.generate(Array.fill(450)(sortOrder))
--- End diff --

This is unnecessary, it's covered by the `5000` test case.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15480: [SPARK-16845][SQL] `GeneratedClass$SpecificOrdering` gro...

2017-01-09 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/15480
  
LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15480: [SPARK-16845][SQL] `GeneratedClass$SpecificOrderi...

2017-01-09 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/15480#discussion_r9536
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateOrdering.scala
 ---
@@ -117,8 +117,31 @@ object GenerateOrdering extends 
CodeGenerator[Seq[SortOrder], Ordering[InternalR
 }
   }
   """
-}.mkString("\n")
-comparisons
+}
+
+ctx.splitExpressions(
+  expressions = comparisons,
+  funcName = "compare",
+  arguments = Seq(("InternalRow", "a"), ("InternalRow", "b")),
+  returnType = "int",
+  makeSplitFunction = { body =>
+s"""
+  InternalRow ${ctx.INPUT_ROW} = null;  // Holds current row being 
evaluated.
+  $body
+  return 0;
+"""
+  },
+  foldFunctions = { funCalls =>
+val comp = ctx.freshName("comp")
+funCalls.zipWithIndex.map { case (funCall, i) =>
+  s"""
+int ${comp}_$i = $funCall;
--- End diff --

nit: `ctx.freshName` already adds postfix to the name, you don't need to 
add `_$i` again.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16481: [SPARK-19092] [SQL] Save() API of DataFrameWriter...

2017-01-09 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/16481#discussion_r95299609
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
 ---
@@ -413,17 +413,22 @@ case class DataSource(
 relation
   }
 
-  /** Writes the given [[DataFrame]] out to this [[DataSource]]. */
+  /**
+   * Writes the given [[DataFrame]] out to this [[DataSource]].
+   *
+   * @param isForWriteOnly Whether to just write the data without 
returning a [[BaseRelation]].
+   */
   def write(
   mode: SaveMode,
-  data: DataFrame): BaseRelation = {
+  data: DataFrame,
+  isForWriteOnly: Boolean = false): Option[BaseRelation] = {
 if 
(data.schema.map(_.dataType).exists(_.isInstanceOf[CalendarIntervalType])) {
   throw new AnalysisException("Cannot save interval data type into 
external storage.")
 }
 
 providingClass.newInstance() match {
   case dataSource: CreatableRelationProvider =>
-dataSource.createRelation(sparkSession.sqlContext, mode, 
caseInsensitiveOptions, data)
+Some(dataSource.createRelation(sparkSession.sqlContext, mode, 
caseInsensitiveOptions, data))
--- End diff --

it would be really weird if `CreatableRelationProvider.createRelation` can 
return a relation with different schema from the written `data`. Is it safe to 
assume the schema won't change? cc @marmbrus @yhuai @liancheng 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16500: [SPARK-19120] [SPARK-19121] Refresh Metadata Cache After...

2017-01-09 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/16500
  
I'm wondering if we need the metadata cache anymore. Now we store 
partitions in the metastore, and have a cache for leaf files, what's the 
benefit of metadata cache?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16487: [SPARK-19107][SQL] support creating hive table wi...

2017-01-09 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/16487#discussion_r95298994
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala ---
@@ -385,6 +380,8 @@ final class DataFrameWriter[T] private[sql](ds: 
Dataset[T]) {
 }
 EliminateSubqueryAliases(catalog.lookupRelation(tableIdentWithDB)) 
match {
   // Only do the check if the table is a data source table (the 
relation is a BaseRelation).
+  // TODO(cloud-fan): also check hive table relation here when we 
support overwrite mode
+  // for creating hive tables.
--- End diff --

We should not block it. This generates `InsertIntoTable`, and it supports 
hive table. What we should block is `saveAsTable` with `Overwrite` mode, which 
generates `CreateTable`.

`insert overwrite` is different from `create table with overwrite mode`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16517: [SPARK-18243][SQL] Port Hive writing to use FileFormat i...

2017-01-09 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16517
  
**[Test build #7 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/7/testReport)**
 for PR 16517 at commit 
[`36c9269`](https://github.com/apache/spark/commit/36c9269c45b4bdca3be36a4e73bf48f8e273a9a3).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16517: [SPARK-18243][SQL] Port Hive writing to use FileF...

2017-01-09 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/16517#discussion_r95298791
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveWriterContainers.scala ---
@@ -1,356 +0,0 @@
-/*
- * Licensed to the Apache Software Foundation (ASF) under one or more
- * contributor license agreements.  See the NOTICE file distributed with
- * this work for additional information regarding copyright ownership.
- * The ASF licenses this file to You under the Apache License, Version 2.0
- * (the "License"); you may not use this file except in compliance with
- * the License.  You may obtain a copy of the License at
- *
- *http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-package org.apache.spark.sql.hive
-
-import java.text.NumberFormat
-import java.util.{Date, Locale}
-
-import scala.collection.JavaConverters._
-
-import org.apache.hadoop.fs.Path
-import org.apache.hadoop.hive.common.FileUtils
-import org.apache.hadoop.hive.conf.HiveConf.ConfVars
-import org.apache.hadoop.hive.ql.exec.{FileSinkOperator, Utilities}
-import org.apache.hadoop.hive.ql.io.{HiveFileFormatUtils, HiveOutputFormat}
-import org.apache.hadoop.hive.ql.plan.TableDesc
-import org.apache.hadoop.hive.serde2.Serializer
-import 
org.apache.hadoop.hive.serde2.objectinspector.{ObjectInspectorUtils, 
StructObjectInspector}
-import 
org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorUtils.ObjectInspectorCopyOption
-import org.apache.hadoop.io.Writable
-import org.apache.hadoop.mapred._
-import org.apache.hadoop.mapreduce.TaskType
-
-import org.apache.spark._
-import org.apache.spark.internal.Logging
-import org.apache.spark.internal.io.SparkHadoopWriterUtils
-import org.apache.spark.mapred.SparkHadoopMapRedUtil
-import org.apache.spark.sql.catalyst.InternalRow
-import org.apache.spark.sql.catalyst.expressions._
-import org.apache.spark.sql.execution.UnsafeKVExternalSorter
-import org.apache.spark.sql.hive.HiveShim.{ShimFileSinkDesc => 
FileSinkDesc}
-import org.apache.spark.sql.types._
-import org.apache.spark.util.SerializableJobConf
-import org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter
-
-/**
- * Internal helper class that saves an RDD using a Hive OutputFormat.
- * It is based on `SparkHadoopWriter`.
- */
-private[hive] class SparkHiveWriterContainer(
-@transient private val jobConf: JobConf,
-fileSinkConf: FileSinkDesc,
-inputSchema: Seq[Attribute])
-  extends Logging
-  with HiveInspectors
-  with Serializable {
-
-  private val now = new Date()
-  private val tableDesc: TableDesc = fileSinkConf.getTableInfo
-  // Add table properties from storage handler to jobConf, so any custom 
storage
-  // handler settings can be set to jobConf
-  if (tableDesc != null) {
-HiveTableUtil.configureJobPropertiesForStorageHandler(tableDesc, 
jobConf, false)
-Utilities.copyTableJobPropertiesToConf(tableDesc, jobConf)
-  }
-  protected val conf = new SerializableJobConf(jobConf)
-
-  private var jobID = 0
-  private var splitID = 0
-  private var attemptID = 0
-  private var jID: SerializableWritable[JobID] = null
-  private var taID: SerializableWritable[TaskAttemptID] = null
-
-  @transient private var writer: FileSinkOperator.RecordWriter = null
-  @transient protected lazy val committer = conf.value.getOutputCommitter
-  @transient protected lazy val jobContext = new 
JobContextImpl(conf.value, jID.value)
-  @transient private lazy val taskContext = new 
TaskAttemptContextImpl(conf.value, taID.value)
-  @transient private lazy val outputFormat =
-conf.value.getOutputFormat.asInstanceOf[HiveOutputFormat[AnyRef, 
Writable]]
-
-  def driverSideSetup() {
-setIDs(0, 0, 0)
-setConfParams()
-committer.setupJob(jobContext)
-  }
-
-  def executorSideSetup(jobId: Int, splitId: Int, attemptId: Int) {
-setIDs(jobId, splitId, attemptId)
-setConfParams()
-committer.setupTask(taskContext)
-initWriters()
-  }
-
-  protected def getOutputName: String = {
-val numberFormat = NumberFormat.getInstance(Locale.US)
-numberFormat.setMinimumIntegerDigits(5)
-numberFormat.setGroupingUsed(false)
-val extension = Utilities.getFileExtension(conf.value, 
fileSinkConf.getCompressed, outputFormat)

[GitHub] spark issue #14725: [SPARK-17161] [PYSPARK][ML] Add PySpark-ML JavaWrapper c...

2017-01-09 Thread BryanCutler

Github user BryanCutler commented on the issue:

https://github.com/apache/spark/pull/14725
  
Sure, I can add a better docstring.  This is just for developers and 
doesn't have to be used, but it can be used to avoid creating more 
Java-friendly functions only because they have arrays - which happens a lot in 
mllib.  Anything to help limit public APIs is very useful, imho.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16487: [SPARK-19107][SQL] support creating hive table wi...

2017-01-09 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/16487#discussion_r95298747
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala ---
@@ -385,6 +380,8 @@ final class DataFrameWriter[T] private[sql](ds: 
Dataset[T]) {
 }
 EliminateSubqueryAliases(catalog.lookupRelation(tableIdentWithDB)) 
match {
   // Only do the check if the table is a data source table (the 
relation is a BaseRelation).
+  // TODO(cloud-fan): also check hive table relation here when we 
support overwrite mode
+  // for creating hive tables.
--- End diff --

Although we ignore the specified provider, we still respect the actual 
format of the table. For example, below is the Hive table. We are not blocking 
it. Should we block it to make them consistent? 
 
```Scala
  sql(s"CREATE TABLE $tableName STORED AS SEQUENCEFILE AS SELECT 1 AS 
key, 'abc' AS value")
  val df = sql(s"SELECT key, value FROM $tableName")
  df.write.mode("overwrite").insertInto(tableName)
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15505: [SPARK-18890][CORE] Move task serialization from ...

2017-01-09 Thread witgo

Github user witgo commented on a diff in the pull request:

https://github.com/apache/spark/pull/15505#discussion_r95298151
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/TaskDescription.scala ---
@@ -52,7 +55,36 @@ private[spark] class TaskDescription(
 val addedFiles: Map[String, Long],
 val addedJars: Map[String, Long],
 val properties: Properties,
-val serializedTask: ByteBuffer) {
+private var serializedTask_ : ByteBuffer,
+private var task_ : Task[_] = null) extends  Logging {
--- End diff --

@squito
I agree with you, I will modify it.

@kayousterhout 
The main reason for this is to reduce the amount of code changes, we need 
to change more code files to implement what you say.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15211: [SPARK-14709][ML] spark.ml API for linear SVM

2017-01-09 Thread hhbyyh

Github user hhbyyh commented on the issue:

https://github.com/apache/spark/pull/15211
  
Sent an update to include a R unit test. Yet I met a problem that there's a 
constant scaling difference between LinearSVC and R 1071 (which essentially is 
LibSVM). It's possible that it's caused by some parameter setting. Post it 
anyway to see if there's any suggestions.

Sorry @zhengruifeng, I'll address your comment in the next update.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16497: [SPARK-19118] [SQL] Percentile support for freque...

2017-01-09 Thread hvanhovell

Github user hvanhovell commented on a diff in the pull request:

https://github.com/apache/spark/pull/16497#discussion_r95297133
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Percentile.scala
 ---
@@ -51,16 +51,31 @@ import org.apache.spark.util.collection.OpenHashMap
   _FUNC_(col, array(percentage1 [, percentage2]...)) - Returns the 
exact percentile value array
   of numeric column `col` at the given percentage(s). Each value of 
the percentage array must
   be between 0.0 and 1.0.
-""")
+
+  _FUNC_(col, frequency, percentage) - Returns the exact percentile 
value of numeric
+  column `col` with frequency column `frequency` at the given 
percentage. The value of
+  percentage must be between 0.0 and 1.0.
+
+  _FUNC_(col, frequency, array(percentage1 [, percentage2]...)) - 
Returns the exact percentile
+  value array of numeric column `col` with frequency column 
`frequency` at the given
+  percentage(s).Each value of the percentage array must be between 0.0 
and 1.0.
+
+  """)
 case class Percentile(
 child: Expression,
+frequency : Expression,
 percentageExpression: Expression,
+withFrqExpr : Boolean,
--- End diff --

Please remove withFrqExpr. The frequency must be provided, and should 
default to 1L.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16497: [SPARK-19118] [SQL] Percentile support for freque...

2017-01-09 Thread hvanhovell

Github user hvanhovell commented on a diff in the pull request:

https://github.com/apache/spark/pull/16497#discussion_r95297232
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Percentile.scala
 ---
@@ -81,7 +96,11 @@ case class Percentile(
   case arrayData: ArrayData => arrayData.toDoubleArray().toSeq
   }
 
-  override def children: Seq[Expression] = child :: percentageExpression 
:: Nil
+  override def children: Seq[Expression] = if (withFrqExpr) {
--- End diff --

Why do we need to make a difference here?

Just do:
```scala
override def children: Seq[Expression] = {
  child :: frequency :: percentageExpression :: Nil
}


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16497: [SPARK-19118] [SQL] Percentile support for freque...

2017-01-09 Thread hvanhovell

Github user hvanhovell commented on a diff in the pull request:

https://github.com/apache/spark/pull/16497#discussion_r95297644
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Percentile.scala
 ---
@@ -91,9 +110,16 @@ case class Percentile(
 case _ => DoubleType
   }
 
-  override def inputTypes: Seq[AbstractDataType] = 
percentageExpression.dataType match {
-case _: ArrayType => Seq(NumericType, ArrayType(DoubleType))
-case _ => Seq(NumericType, DoubleType)
+  override def inputTypes: Seq[AbstractDataType] = {
+val percentageExpType = percentageExpression.dataType match {
+  case _: ArrayType => ArrayType(DoubleType)
+  case _ => DoubleType
+}
+if (withFrqExpr) {
--- End diff --

Again remove withFrqExpr.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16497: [SPARK-19118] [SQL] Percentile support for freque...

2017-01-09 Thread hvanhovell

Github user hvanhovell commented on a diff in the pull request:

https://github.com/apache/spark/pull/16497#discussion_r95297633
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Percentile.scala
 ---
@@ -126,10 +152,15 @@ case class Percentile(
   buffer: OpenHashMap[Number, Long],
   input: InternalRow): OpenHashMap[Number, Long] = {
 val key = child.eval(input).asInstanceOf[Number]
+val frqValue = frequency.eval(input)
 
 // Null values are ignored in counts map.
-if (key != null) {
-  buffer.changeValue(key, 1L, _ + 1L)
+if (key != null && frqValue != null) {
+  val frqLong = frqValue.asInstanceOf[Number].longValue()
+  // add only when frequency is positive
+  if (frqLong > 0) {
--- End diff --

Lets make this a requirement and fail when the value < 0.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16497: [SPARK-19118] [SQL] Percentile support for freque...

2017-01-09 Thread hvanhovell

Github user hvanhovell commented on a diff in the pull request:

https://github.com/apache/spark/pull/16497#discussion_r95297609
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Percentile.scala
 ---
@@ -126,10 +152,15 @@ case class Percentile(
   buffer: OpenHashMap[Number, Long],
   input: InternalRow): OpenHashMap[Number, Long] = {
 val key = child.eval(input).asInstanceOf[Number]
+val frqValue = frequency.eval(input)
 
 // Null values are ignored in counts map.
-if (key != null) {
-  buffer.changeValue(key, 1L, _ + 1L)
+if (key != null && frqValue != null) {
--- End diff --

I don't think we should default to 1L as a value. That seems more wrong 
than either failing or skipping.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16487: [SPARK-19107][SQL] support creating hive table with Data...

2017-01-09 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16487
  
**[Test build #71110 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71110/testReport)**
 for PR 16487 at commit 
[`6209d04`](https://github.com/apache/spark/commit/6209d04709aee523b8fd43652fbcc87c8908e168).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16497: [SPARK-19118] [SQL] Percentile support for freque...

2017-01-09 Thread jiangxb1987

Github user jiangxb1987 commented on a diff in the pull request:

https://github.com/apache/spark/pull/16497#discussion_r95295321
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Percentile.scala
 ---
@@ -126,10 +152,15 @@ case class Percentile(
   buffer: OpenHashMap[Number, Long],
   input: InternalRow): OpenHashMap[Number, Long] = {
 val key = child.eval(input).asInstanceOf[Number]
+val frqValue = frequency.eval(input)
--- End diff --

We should check the value of the `frequencyExpression` beforehand.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16497: [SPARK-19118] [SQL] Percentile support for freque...

2017-01-09 Thread jiangxb1987

Github user jiangxb1987 commented on a diff in the pull request:

https://github.com/apache/spark/pull/16497#discussion_r95294516
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Percentile.scala
 ---
@@ -51,16 +51,31 @@ import org.apache.spark.util.collection.OpenHashMap
   _FUNC_(col, array(percentage1 [, percentage2]...)) - Returns the 
exact percentile value array
   of numeric column `col` at the given percentage(s). Each value of 
the percentage array must
   be between 0.0 and 1.0.
-""")
+
+  _FUNC_(col, frequency, percentage) - Returns the exact percentile 
value of numeric
+  column `col` with frequency column `frequency` at the given 
percentage. The value of
+  percentage must be between 0.0 and 1.0.
+
+  _FUNC_(col, frequency, array(percentage1 [, percentage2]...)) - 
Returns the exact percentile
+  value array of numeric column `col` with frequency column 
`frequency` at the given
+  percentage(s).Each value of the percentage array must be between 0.0 
and 1.0.
+
+  """)
 case class Percentile(
 child: Expression,
+frequency : Expression,
--- End diff --

Rename this to `frequencyExpression`, and place this after 
`percentageExpression`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

1 2 3 4 5 >

1 - 100 of 459 matches

Mail list logo