[GitHub] spark pull request #16233: [SPARK-18801][SQL] Support resolve a nested view
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/16233#discussion_r95313666 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/catalyst/SQLBuilder.scala --- @@ -219,6 +219,9 @@ class SQLBuilder private ( case OneRowRelation => "" +case p: View => --- End diff -- Maybe add a TODO comment above this line. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM
Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/16344 Jenkins, test this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16344 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM
GitHub user actuaryzhang reopened a pull request: https://github.com/apache/spark/pull/16344 [SPARK-18929][ML] Add Tweedie distribution in GLM ## What changes were proposed in this pull request? I propose to add the full Tweedie family into the GeneralizedLinearRegression model. The Tweedie family is characterized by a power variance function. Currently supported distributions such as Gaussian, Poisson and Gamma families are a special case of the Tweedie https://en.wikipedia.org/wiki/Tweedie_distribution. @yanboliang @srowen @sethah I propose to add support for the other distributions: - compound Poisson: 1 < varPower < 2. This one is widely used to model zero-inflated continuous distributions, e.g., in insurance, finance, ecology, meteorology, advertising etc. - positive stable: varPower > 2 and varPower != 3. Used to model extreme values. - inverse Gaussian: varPower = 3. The Tweedie family is supported in most statistical packages such as R (statmod), SAS, h2o etc. Changes made: - Allow `tweedie` in family. Only `identity` and `log` links are allowed for now. - Add `varPower` to `GeneralizedLinearRegressionBase`, which takes values in (1, 2) and (2, infty). Also set default value to 1.5 and add getter method. - Add `Tweedie` class - Add tests for tweedie GLM Note: - In computing deviance, use `math.max(y, 0.1)` to avoid taking inverse of 0. This is the same as in R: `tweedie()$dev.res` - `aic` is not supported in this PR because the evaluation of the [Tweedie density](http://www.statsci.org/smyth/pubs/tweediepdf-series-preprint.pdf) in these cases are non-trivial. I will implement the density approximation method in a future PR. R returns `null` (see `tweedie()$aic`). You can merge this pull request into a Git repository by running: $ git pull https://github.com/actuaryzhang/spark tweedie Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/16344.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #16344 commit 952887e485fb0d5fa669b3b4c9289b8069ee7769 Author: actuaryzhangDate: 2016-12-16T00:50:51Z Add Tweedie family to GLM commit 4f184ec458f5ed7d70bc5b8165481425f911d2a3 Author: actuaryzhang Date: 2016-12-19T22:50:02Z Fix calculation in dev resid; Add test for different var power commit 7fe39106332663d3671b94a8ffac48ca61c48470 Author: actuaryzhang Date: 2016-12-19T23:14:37Z Merge test into GLR commit bfcc4fb08d54156efc66b90d14c62ea7ff172afa Author: actuaryzhang Date: 2016-12-20T22:59:05Z Use Tweedie class instead of global object Tweedie; change variancePower to varPower commit a8feea7d8095170c1b5f18b7887f16a6d763e42c Author: actuaryzhang Date: 2016-12-21T23:42:40Z Allow Family to use GLRBase object directly commit 233e2d338be8d36a74eaf578bfea804ae3617d4e Author: actuaryzhang Date: 2016-12-22T01:56:34Z Add TweedieFamily and implement specific distn within Tweedie commit 17c55816c914bc96a8b5141356e3c117f343f303 Author: actuaryzhang Date: 2016-12-22T04:39:54Z Clean up doc commit 0b41825e99020976a34d8fe9c983f26de6c8c40f Author: actuaryzhang Date: 2016-12-22T17:52:01Z Move defaultLink and name to subclass of TweedieFamily commit 6e8e60771afb4abe43e47c7fe186bad1541a8fac Author: actuaryzhang Date: 2016-12-22T18:10:51Z Change style for AIC commit 8d7d34e258f9c7c03c80754d837ce847fcb0526e Author: actuaryzhang Date: 2016-12-23T19:10:20Z Rename Family methods and restore methods for tweedie subclasses commit 6da7e3068e2c45a0faf7ff35c10b2750784d765e Author: actuaryzhang Date: 2016-12-23T19:12:25Z Update test commit 9a71e89f629260c775922901a04c989f36ea4946 Author: actuaryzhang Date: 2016-12-27T17:16:40Z Clean up doc commit f461c09e65360f695ad3092b41bc26e0c61bbd95 Author: actuaryzhang Date: 2016-12-27T22:18:39Z Put delta in Tweedie companion object commit a839c4631dd17c4f3d0a0cc99e1b0af81419dda4 Author: actuaryzhang Date: 2016-12-27T22:23:57Z Clean up doc commit fab265278109eede4cce7ee506e8b29d481c4549 Author: actuaryzhang Date: 2017-01-05T19:32:06Z Allow more link functions in tweedie --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working,
[GitHub] spark issue #15480: [SPARK-16845][SQL] `GeneratedClass$SpecificOrdering` gro...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15480 **[Test build #71115 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71115/testReport)** for PR 15480 at commit [`4aef473`](https://github.com/apache/spark/commit/4aef473b70ef20d7a93fd974a3d844e49ca6ef9e). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16487: [SPARK-19107][SQL] support creating hive table with Data...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16487 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16487: [SPARK-19107][SQL] support creating hive table with Data...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16487 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/71112/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16487: [SPARK-19107][SQL] support creating hive table with Data...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16487 **[Test build #71112 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71112/testReport)** for PR 16487 at commit [`27d97e5`](https://github.com/apache/spark/commit/27d97e56295465bb717da9ff08d2571998adf2b1). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #12064: [SPARK-14272][ML] Evaluate GaussianMixtureModel with Log...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/12064 **[Test build #71114 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71114/testReport)** for PR 12064 at commit [`083e4f9`](https://github.com/apache/spark/commit/083e4f932ee9bd65d211bcb8506d53bb35c160f6). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16233: [SPARK-18801][SQL] Support resolve a nested view
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/16233#discussion_r95310611 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/view.scala --- @@ -0,0 +1,90 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.analysis + +import org.apache.spark.sql.AnalysisException +import org.apache.spark.sql.catalyst.CatalystConf +import org.apache.spark.sql.catalyst.expressions.{Alias, Attribute, Cast} +import org.apache.spark.sql.catalyst.plans.logical.{LogicalPlan, Project, View} +import org.apache.spark.sql.catalyst.rules.Rule + +/** + * This file defines analysis rules related to views. + */ + +/** + * Make sure that a view's child plan produces the view's output attributes. We wrap the child + * with a Project and add an alias for each output attribute. The attributes are resolved by + * name. This should be only done after the resolution batch, because the view attributes are + * not stable during resolution. + */ +case class AliasViewChild(conf: CatalystConf) extends Rule[LogicalPlan] { + override def apply(plan: LogicalPlan): LogicalPlan = plan resolveOperators { +case v @ View(_, output, child) if child.resolved => + val resolver = conf.resolver + val newOutput = child.output.map { attr => +val newAttr = findAttributeByName(attr.name, output, resolver) +// Check the dataType of the output attributes, throw an AnalysisException if they don't +// match up. +checkDataType(attr, newAttr) +Alias(attr, attr.name)(exprId = newAttr.exprId, qualifier = newAttr.qualifier, + explicitMetadata = Some(newAttr.metadata)) + } + v.copy(child = Project(newOutput, child)) + } + + /** + * Find the attribute that has the expected attribute name from an attribute list, the names + * are compared using conf.resolver. + * If the expected attribute is not found, throw an AnalysisException. + */ + private def findAttributeByName( + name: String, + attrs: Seq[Attribute], + resolver: Resolver): Attribute = { +attrs.collectFirst { + case attr if resolver(attr.name, name) => attr +}.getOrElse(throw new AnalysisException( + s"Attribute with name '$name' is not found in " + +s"'${attrs.map(_.name).mkString("(", ",", ")")}'")) + } + + /** + * Check whether the dataType of `attr` could be casted to that of `other`, throw an + * AnalysisException if the both attributes don't match up. + */ + private def checkDataType(attr: Attribute, other: Attribute): Unit = { +if (!Cast.canCast(attr.dataType, other.dataType)) { + throw new AnalysisException( +s"The dataType of attribute '$other' is '${other.dataType}', which can't be casted to " + + s"that of '$attr', expected '${attr.dataType}'.") +} + } +} + +/** + * Removes [[View]] operators from the plan. The operator is respected till the end of analysis + * stage because we want to see which part of a analyzed logical plan is generated from a view. + */ +object EliminateView extends Rule[LogicalPlan] { + def apply(plan: LogicalPlan): LogicalPlan = plan transformUp { +// The child should have the same output attributes with the View operator, so we simply --- End diff -- : ) I did not see your comment. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail:
[GitHub] spark issue #16476: [SPARK-19084][SQL] Implement expression field
Github user gczsjdy commented on the issue: https://github.com/apache/spark/pull/16476 @chenghao-intel I think that the optimize rule will fold the parameters that have different types with param0, and then disorganize the parameters' index. Thanks. @rxin I have removed the [WIP] mark, this PR is ready now, could you please help me review it? Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15480: [SPARK-16845][SQL] `GeneratedClass$SpecificOrderi...
Github user lw-lin commented on a diff in the pull request: https://github.com/apache/spark/pull/15480#discussion_r95310486 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateOrdering.scala --- @@ -117,8 +117,31 @@ object GenerateOrdering extends CodeGenerator[Seq[SortOrder], Ordering[InternalR } } """ -}.mkString("\n") -comparisons +} + +ctx.splitExpressions( + expressions = comparisons, + funcName = "compare", + arguments = Seq(("InternalRow", "a"), ("InternalRow", "b")), + returnType = "int", + makeSplitFunction = { body => +s""" + InternalRow ${ctx.INPUT_ROW} = null; // Holds current row being evaluated. + $body + return 0; +""" + }, + foldFunctions = { funCalls => +val comp = ctx.freshName("comp") +funCalls.zipWithIndex.map { case (funCall, i) => + s""" +int ${comp}_$i = $funCall; --- End diff -- Ah i see - let me update this :-) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16233: [SPARK-18801][SQL] Support resolve a nested view
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/16233#discussion_r95310499 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/view.scala --- @@ -0,0 +1,90 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.analysis + +import org.apache.spark.sql.AnalysisException +import org.apache.spark.sql.catalyst.CatalystConf +import org.apache.spark.sql.catalyst.expressions.{Alias, Attribute, Cast} +import org.apache.spark.sql.catalyst.plans.logical.{LogicalPlan, Project, View} +import org.apache.spark.sql.catalyst.rules.Rule + +/** + * This file defines analysis rules related to views. + */ + +/** + * Make sure that a view's child plan produces the view's output attributes. We wrap the child + * with a Project and add an alias for each output attribute. The attributes are resolved by + * name. This should be only done after the resolution batch, because the view attributes are + * not stable during resolution. + */ +case class AliasViewChild(conf: CatalystConf) extends Rule[LogicalPlan] { + override def apply(plan: LogicalPlan): LogicalPlan = plan resolveOperators { +case v @ View(_, output, child) if child.resolved => + val resolver = conf.resolver + val newOutput = child.output.map { attr => +val newAttr = findAttributeByName(attr.name, output, resolver) +// Check the dataType of the output attributes, throw an AnalysisException if they don't +// match up. +checkDataType(attr, newAttr) +Alias(attr, attr.name)(exprId = newAttr.exprId, qualifier = newAttr.qualifier, + explicitMetadata = Some(newAttr.metadata)) + } + v.copy(child = Project(newOutput, child)) + } + + /** + * Find the attribute that has the expected attribute name from an attribute list, the names + * are compared using conf.resolver. + * If the expected attribute is not found, throw an AnalysisException. + */ + private def findAttributeByName( + name: String, + attrs: Seq[Attribute], + resolver: Resolver): Attribute = { +attrs.collectFirst { + case attr if resolver(attr.name, name) => attr +}.getOrElse(throw new AnalysisException( + s"Attribute with name '$name' is not found in " + +s"'${attrs.map(_.name).mkString("(", ",", ")")}'")) + } + + /** + * Check whether the dataType of `attr` could be casted to that of `other`, throw an + * AnalysisException if the both attributes don't match up. + */ + private def checkDataType(attr: Attribute, other: Attribute): Unit = { +if (!Cast.canCast(attr.dataType, other.dataType)) { + throw new AnalysisException( +s"The dataType of attribute '$other' is '${other.dataType}', which can't be casted to " + + s"that of '$attr', expected '${attr.dataType}'.") +} + } +} + +/** + * Removes [[View]] operators from the plan. The operator is respected till the end of analysis + * stage because we want to see which part of a analyzed logical plan is generated from a view. + */ +object EliminateView extends Rule[LogicalPlan] { + def apply(plan: LogicalPlan): LogicalPlan = plan transformUp { +// The child should have the same output attributes with the View operator, so we simply --- End diff -- Can we add an assert below to ensure the outputs are the same? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For
[GitHub] spark issue #16487: [SPARK-19107][SQL] support creating hive table with Data...
Github user yhuai commented on the issue: https://github.com/apache/spark/pull/16487 LGTM --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16510: [SPARK-19130][SPARKR] Support setting literal value as c...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/16510 This is pretty useful! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16233: [SPARK-18801][SQL] Support resolve a nested view
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/16233#discussion_r95309414 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/view.scala --- @@ -0,0 +1,89 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.analysis + +import org.apache.spark.sql.AnalysisException +import org.apache.spark.sql.catalyst.CatalystConf +import org.apache.spark.sql.catalyst.expressions.{Alias, Attribute, Cast} +import org.apache.spark.sql.catalyst.plans.logical.{LogicalPlan, Project, View} +import org.apache.spark.sql.catalyst.rules.Rule + +/** + * This file defines analysis rules related to views. + */ + +/** + * Make sure that a view's child plan produces the view's output attributes. We wrap the child + * with a Project and add an alias for each output attribute. The attributes are resolved by + * name. This should be only done after the resolution batch, because the view attributes are + * not stable during resolution. + */ +case class AliasViewChild(conf: CatalystConf) extends Rule[LogicalPlan] { + override def apply(plan: LogicalPlan): LogicalPlan = plan resolveOperators { +case v @ View(_, output, child) if child.resolved => + val resolver = conf.resolver + val newOutput = child.output.map { attr => +val newAttr = findAttributeByName(attr.name, output, resolver) +// Check the dataType of the output attributes, throw an AnalysisException if they don't +// match up. +checkDataType(attr, newAttr) +Alias(attr, attr.name)(exprId = newAttr.exprId, qualifier = newAttr.qualifier, + explicitMetadata = Some(newAttr.metadata)) + } + v.copy(child = Project(newOutput, child)) + } + + /** + * Find the attribute that has the expected attribute name from an attribute list, the names + * are compared using conf.resolver. + * If the expected attribute is not found, throw an AnalysisException. + */ + private def findAttributeByName( + name: String, + attrs: Seq[Attribute], + resolver: Resolver): Attribute = { +attrs.collectFirst { + case attr if resolver(attr.name, name) => attr +}.getOrElse(throw new AnalysisException( + s"Attribute with name '$name' is not found in " + +s"'${attrs.map(_.name).mkString("(", ",", ")")}'")) + } + + /** + * Check whether the dataType of `attr` could be casted to that of `other`, throw an + * AnalysisException if the both attributes don't match up. + */ + private def checkDataType(attr: Attribute, other: Attribute): Unit = { +if (!Cast.canCast(attr.dataType, other.dataType)) { --- End diff -- The rule `ImplicitTypeCasts` is done in the batch of `Resolution`, right? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16517: [SPARK-18243][SQL] Port Hive writing to use FileFormat i...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16517 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/7/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16233: [SPARK-18801][SQL] Support resolve a nested view
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/16233#discussion_r95309984 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/view.scala --- @@ -0,0 +1,90 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.analysis + +import org.apache.spark.sql.AnalysisException +import org.apache.spark.sql.catalyst.CatalystConf +import org.apache.spark.sql.catalyst.expressions.{Alias, Attribute, Cast} +import org.apache.spark.sql.catalyst.plans.logical.{LogicalPlan, Project, View} +import org.apache.spark.sql.catalyst.rules.Rule + +/** + * This file defines analysis rules related to views. + */ + +/** + * Make sure that a view's child plan produces the view's output attributes. We wrap the child + * with a Project and add an alias for each output attribute. The attributes are resolved by + * name. This should be only done after the resolution batch, because the view attributes are + * not stable during resolution. + */ +case class AliasViewChild(conf: CatalystConf) extends Rule[LogicalPlan] { + override def apply(plan: LogicalPlan): LogicalPlan = plan resolveOperators { +case v @ View(_, output, child) if child.resolved => + val resolver = conf.resolver + val newOutput = child.output.map { attr => +val newAttr = findAttributeByName(attr.name, output, resolver) +// Check the dataType of the output attributes, throw an AnalysisException if they don't +// match up. +checkDataType(attr, newAttr) +Alias(attr, attr.name)(exprId = newAttr.exprId, qualifier = newAttr.qualifier, + explicitMetadata = Some(newAttr.metadata)) + } + v.copy(child = Project(newOutput, child)) + } + + /** + * Find the attribute that has the expected attribute name from an attribute list, the names + * are compared using conf.resolver. + * If the expected attribute is not found, throw an AnalysisException. + */ + private def findAttributeByName( + name: String, + attrs: Seq[Attribute], + resolver: Resolver): Attribute = { +attrs.collectFirst { + case attr if resolver(attr.name, name) => attr +}.getOrElse(throw new AnalysisException( + s"Attribute with name '$name' is not found in " + +s"'${attrs.map(_.name).mkString("(", ",", ")")}'")) + } + + /** + * Check whether the dataType of `attr` could be casted to that of `other`, throw an + * AnalysisException if the both attributes don't match up. + */ + private def checkDataType(attr: Attribute, other: Attribute): Unit = { +if (!Cast.canCast(attr.dataType, other.dataType)) { + throw new AnalysisException( +s"The dataType of attribute '$other' is '${other.dataType}', which can't be casted to " + + s"that of '$attr', expected '${attr.dataType}'.") +} + } +} + +/** + * Removes [[View]] operators from the plan. The operator is respected till the end of analysis + * stage because we want to see which part of a analyzed logical plan is generated from a view. --- End diff -- Nit: `a analyzed` ->`an analyzed` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16233: [SPARK-18801][SQL] Support resolve a nested view
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/16233#discussion_r95309902 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/view.scala --- @@ -0,0 +1,90 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.analysis + +import org.apache.spark.sql.AnalysisException +import org.apache.spark.sql.catalyst.CatalystConf +import org.apache.spark.sql.catalyst.expressions.{Alias, Attribute, Cast} +import org.apache.spark.sql.catalyst.plans.logical.{LogicalPlan, Project, View} +import org.apache.spark.sql.catalyst.rules.Rule + +/** + * This file defines analysis rules related to views. + */ + +/** + * Make sure that a view's child plan produces the view's output attributes. We wrap the child + * with a Project and add an alias for each output attribute. The attributes are resolved by + * name. This should be only done after the resolution batch, because the view attributes are + * not stable during resolution. + */ +case class AliasViewChild(conf: CatalystConf) extends Rule[LogicalPlan] { + override def apply(plan: LogicalPlan): LogicalPlan = plan resolveOperators { +case v @ View(_, output, child) if child.resolved => + val resolver = conf.resolver + val newOutput = child.output.map { attr => +val newAttr = findAttributeByName(attr.name, output, resolver) +// Check the dataType of the output attributes, throw an AnalysisException if they don't +// match up. +checkDataType(attr, newAttr) +Alias(attr, attr.name)(exprId = newAttr.exprId, qualifier = newAttr.qualifier, + explicitMetadata = Some(newAttr.metadata)) + } + v.copy(child = Project(newOutput, child)) + } + + /** + * Find the attribute that has the expected attribute name from an attribute list, the names + * are compared using conf.resolver. + * If the expected attribute is not found, throw an AnalysisException. + */ + private def findAttributeByName( + name: String, + attrs: Seq[Attribute], + resolver: Resolver): Attribute = { +attrs.collectFirst { + case attr if resolver(attr.name, name) => attr +}.getOrElse(throw new AnalysisException( + s"Attribute with name '$name' is not found in " + +s"'${attrs.map(_.name).mkString("(", ",", ")")}'")) + } + + /** + * Check whether the dataType of `attr` could be casted to that of `other`, throw an + * AnalysisException if the both attributes don't match up. + */ + private def checkDataType(attr: Attribute, other: Attribute): Unit = { +if (!Cast.canCast(attr.dataType, other.dataType)) { + throw new AnalysisException( +s"The dataType of attribute '$other' is '${other.dataType}', which can't be casted to " + + s"that of '$attr', expected '${attr.dataType}'.") +} + } +} + +/** + * Removes [[View]] operators from the plan. The operator is respected till the end of analysis + * stage because we want to see which part of a analyzed logical plan is generated from a view. + */ +object EliminateView extends Rule[LogicalPlan] { + def apply(plan: LogicalPlan): LogicalPlan = plan transformUp { --- End diff -- Nit: `transformUp` -> `transform`. When the order does not matter, we use `transform` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16233: [SPARK-18801][SQL] Support resolve a nested view
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/16233#discussion_r95309603 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/view.scala --- @@ -0,0 +1,90 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.analysis + +import org.apache.spark.sql.AnalysisException +import org.apache.spark.sql.catalyst.CatalystConf +import org.apache.spark.sql.catalyst.expressions.{Alias, Attribute, Cast} +import org.apache.spark.sql.catalyst.plans.logical.{LogicalPlan, Project, View} +import org.apache.spark.sql.catalyst.rules.Rule + +/** + * This file defines analysis rules related to views. + */ + +/** + * Make sure that a view's child plan produces the view's output attributes. We wrap the child + * with a Project and add an alias for each output attribute. The attributes are resolved by + * name. This should be only done after the resolution batch, because the view attributes are + * not stable during resolution. + */ +case class AliasViewChild(conf: CatalystConf) extends Rule[LogicalPlan] { + override def apply(plan: LogicalPlan): LogicalPlan = plan resolveOperators { +case v @ View(_, output, child) if child.resolved => + val resolver = conf.resolver + val newOutput = child.output.map { attr => +val newAttr = findAttributeByName(attr.name, output, resolver) +// Check the dataType of the output attributes, throw an AnalysisException if they don't +// match up. +checkDataType(attr, newAttr) +Alias(attr, attr.name)(exprId = newAttr.exprId, qualifier = newAttr.qualifier, + explicitMetadata = Some(newAttr.metadata)) + } + v.copy(child = Project(newOutput, child)) + } + + /** + * Find the attribute that has the expected attribute name from an attribute list, the names + * are compared using conf.resolver. + * If the expected attribute is not found, throw an AnalysisException. + */ + private def findAttributeByName( + name: String, + attrs: Seq[Attribute], + resolver: Resolver): Attribute = { +attrs.collectFirst { + case attr if resolver(attr.name, name) => attr +}.getOrElse(throw new AnalysisException( + s"Attribute with name '$name' is not found in " + +s"'${attrs.map(_.name).mkString("(", ",", ")")}'")) + } + + /** + * Check whether the dataType of `attr` could be casted to that of `other`, throw an + * AnalysisException if the both attributes don't match up. + */ + private def checkDataType(attr: Attribute, other: Attribute): Unit = { +if (!Cast.canCast(attr.dataType, other.dataType)) { + throw new AnalysisException( +s"The dataType of attribute '$other' is '${other.dataType}', which can't be casted to " + + s"that of '$attr', expected '${attr.dataType}'.") +} + } +} + +/** + * Removes [[View]] operators from the plan. The operator is respected till the end of analysis + * stage because we want to see which part of a analyzed logical plan is generated from a view. + */ +object EliminateView extends Rule[LogicalPlan] { + def apply(plan: LogicalPlan): LogicalPlan = plan transformUp { +// The child should have the same output attributes with the View operator, so we simply +// remove the View operator. +case View(_, output, child) => child --- End diff -- Nit: `case View(_, output, child) => child` -> `case View(_, _, child) => child` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request #16233: [SPARK-18801][SQL] Support resolve a nested view
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/16233#discussion_r95309481 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/view.scala --- @@ -0,0 +1,89 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.analysis + +import org.apache.spark.sql.AnalysisException +import org.apache.spark.sql.catalyst.CatalystConf +import org.apache.spark.sql.catalyst.expressions.{Alias, Attribute, Cast} +import org.apache.spark.sql.catalyst.plans.logical.{LogicalPlan, Project, View} +import org.apache.spark.sql.catalyst.rules.Rule + +/** + * This file defines analysis rules related to views. + */ + +/** + * Make sure that a view's child plan produces the view's output attributes. We wrap the child + * with a Project and add an alias for each output attribute. The attributes are resolved by + * name. This should be only done after the resolution batch, because the view attributes are + * not stable during resolution. + */ +case class AliasViewChild(conf: CatalystConf) extends Rule[LogicalPlan] { + override def apply(plan: LogicalPlan): LogicalPlan = plan resolveOperators { +case v @ View(_, output, child) if child.resolved => + val resolver = conf.resolver + val newOutput = child.output.map { attr => +val newAttr = findAttributeByName(attr.name, output, resolver) +// Check the dataType of the output attributes, throw an AnalysisException if they don't +// match up. +checkDataType(attr, newAttr) +Alias(attr, attr.name)(exprId = newAttr.exprId, qualifier = newAttr.qualifier, + explicitMetadata = Some(newAttr.metadata)) + } + v.copy(child = Project(newOutput, child)) + } + + /** + * Find the attribute that has the expected attribute name from an attribute list, the names + * are compared using conf.resolver. + * If the expected attribute is not found, throw an AnalysisException. + */ + private def findAttributeByName( + name: String, + attrs: Seq[Attribute], + resolver: Resolver): Attribute = { +attrs.collectFirst { + case attr if resolver(attr.name, name) => attr +}.getOrElse(throw new AnalysisException( + s"Attribute with name '$name' is not found in " + +s"'${attrs.map(_.name).mkString("(", ",", ")")}'")) + } + + /** + * Check whether the dataType of `attr` could be casted to that of `other`, throw an + * AnalysisException if the both attributes don't match up. + */ + private def checkDataType(attr: Attribute, other: Attribute): Unit = { +if (!Cast.canCast(attr.dataType, other.dataType)) { --- End diff -- Have you added any test case for type casting? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16517: [SPARK-18243][SQL] Port Hive writing to use FileFormat i...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16517 **[Test build #7 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/7/testReport)** for PR 16517 at commit [`36c9269`](https://github.com/apache/spark/commit/36c9269c45b4bdca3be36a4e73bf48f8e273a9a3). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16517: [SPARK-18243][SQL] Port Hive writing to use FileFormat i...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16517 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15480: [SPARK-16845][SQL] `GeneratedClass$SpecificOrderi...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/15480#discussion_r95309134 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateOrdering.scala --- @@ -117,8 +117,31 @@ object GenerateOrdering extends CodeGenerator[Seq[SortOrder], Ordering[InternalR } } """ -}.mkString("\n") -comparisons +} + +ctx.splitExpressions( + expressions = comparisons, + funcName = "compare", + arguments = Seq(("InternalRow", "a"), ("InternalRow", "b")), + returnType = "int", + makeSplitFunction = { body => +s""" + InternalRow ${ctx.INPUT_ROW} = null; // Holds current row being evaluated. + $body + return 0; +""" + }, + foldFunctions = { funCalls => +val comp = ctx.freshName("comp") +funCalls.zipWithIndex.map { case (funCall, i) => + s""" +int ${comp}_$i = $funCall; --- End diff -- oh sorry, we should put the `freshName` in the map function ``` funCalls.zipWithIndex.map { case (funCall, i) => val comp = ctx.freshName("comp") s""" int $comp = $funCall; ... } ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16233: [SPARK-18801][SQL] Support resolve a nested view
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/16233#discussion_r95308910 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/view.scala --- @@ -0,0 +1,90 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.analysis + +import org.apache.spark.sql.AnalysisException +import org.apache.spark.sql.catalyst.CatalystConf +import org.apache.spark.sql.catalyst.expressions.{Alias, Attribute, Cast} +import org.apache.spark.sql.catalyst.plans.logical.{LogicalPlan, Project, View} +import org.apache.spark.sql.catalyst.rules.Rule + +/** + * This file defines analysis rules related to views. + */ + +/** + * Make sure that a view's child plan produces the view's output attributes. We wrap the child + * with a Project and add an alias for each output attribute. The attributes are resolved by + * name. This should be only done after the resolution batch, because the view attributes are + * not stable during resolution. --- End diff -- `stable` -> `completely resolved` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16233: [SPARK-18801][SQL] Support resolve a nested view
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/16233#discussion_r95308787 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/view.scala --- @@ -0,0 +1,90 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.analysis + +import org.apache.spark.sql.AnalysisException +import org.apache.spark.sql.catalyst.CatalystConf +import org.apache.spark.sql.catalyst.expressions.{Alias, Attribute, Cast} +import org.apache.spark.sql.catalyst.plans.logical.{LogicalPlan, Project, View} +import org.apache.spark.sql.catalyst.rules.Rule + +/** + * This file defines analysis rules related to views. + */ + +/** + * Make sure that a view's child plan produces the view's output attributes. We wrap the child + * with a Project and add an alias for each output attribute. The attributes are resolved by + * name. This should be only done after the resolution batch, because the view attributes are --- End diff -- Nit: `after the resolution batch` -> `after the batch of Resolution` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16497: [SPARK-19118] [SQL] Percentile support for freque...
Github user jiangxb1987 commented on a diff in the pull request: https://github.com/apache/spark/pull/16497#discussion_r95308691 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Percentile.scala --- @@ -81,7 +96,11 @@ case class Percentile( case arrayData: ArrayData => arrayData.toDoubleArray().toSeq } - override def children: Seq[Expression] = child :: percentageExpression :: Nil + override def children: Seq[Expression] = if (withFrqExpr) { --- End diff -- Let's make the param sequence to: `child, percentageExpression, frequencyExpression`, and give a default value to `frequencyExpression`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16522: [SPARK-19137][SQL] Fix `withSQLConf` to reset `OptionalC...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/16522 Hi, @vanzin and @zsxwing . The PR passes the tests. Could you review this PR again? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16440: [SPARK-18857][SQL] Don't use `Iterator.duplicate` for `i...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/16440 Hi, @srowen . Could you merge this PR? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16487: [SPARK-19107][SQL] support creating hive table with Data...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/16487 LGTM pending test --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16233: [SPARK-18801][SQL] Support resolve a nested view
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/16233#discussion_r95307494 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -510,32 +542,90 @@ class Analyzer( * Replaces [[UnresolvedRelation]]s with concrete relations from the catalog. */ object ResolveRelations extends Rule[LogicalPlan] { -private def lookupTableFromCatalog(u: UnresolvedRelation): LogicalPlan = { + +// If the unresolved relation is running directly on files, we just return the original +// UnresolvedRelation, the plan will get resolved later. Else we look up the table from catalog +// and change the default database name if it is a view. +// We usually look up a table from the default database if the table identifier has an empty +// database part, for a view the default database should be the currentDb when the view was +// created. When the case comes to resolving a nested view, the view may have different default +// database with that the referenced view has, so we need to use the variable `defaultDatabase` +// to track the current default database. +// When the relation we resolve is a view, we fetch the view.desc(which is a CatalogTable), and +// then set the value of `CatalogTable.viewDefaultDatabase` to the variable `defaultDatabase`, +// we look up the relations that the view references using the default database. +// For example: +// |- view1 (defaultDatabase = db1) +// |- operator +// |- table2 (defaultDatabase = db1) +// |- view2 (defaultDatabase = db2) +//|- view3 (defaultDatabase = db3) +// |- view4 (defaultDatabase = db4) +// In this case, the view `view1` is a nested view, it directly references `table2`ã`view2` +// and `view4`, the view `view2` references `view3`. On resolving the table, we look up the +// relations `table2`ã`view2`ã`view4` using the default database `db1`, and look up the +// relation `view3` using the default database `db2`. +// +// Note this is compatible with the views defined by older versions of Spark(before 2.2), which +// have empty defaultDatabase and all the relations in viewText have database part defined. +def resolveRelation( +plan: LogicalPlan, +defaultDatabase: Option[String] = None): LogicalPlan = plan match { + case u: UnresolvedRelation if !isRunningDirectlyOnFiles(u.tableIdentifier) => +val defaultDatabase = AnalysisContext.get.defaultDatabase +val relation = lookupTableFromCatalog(u, defaultDatabase) +resolveRelation(relation, defaultDatabase) + // The view's child should be a logical plan parsed from the `desc.viewText`, the variable + // `viewText` should be defined, or else we throw an error on the generation of the View + // operator. + case view @ View(desc, _, child) if !child.resolved => +val nestedViewLevel = AnalysisContext.get.nestedViewLevel + 1 +val context = AnalysisContext(defaultDatabase = desc.viewDefaultDatabase, + nestedViewLevel = nestedViewLevel) +// Resolve all the UnresolvedRelations and Views in the child. +val newChild = AnalysisContext.withAnalysisContext(context) { + execute(child) +} +view.copy(child = newChild) + case p @ SubqueryAlias(_, view: View, _) => +val newChild = resolveRelation(view, defaultDatabase) +p.copy(child = newChild) + case _ => plan +} + +def apply(plan: LogicalPlan): LogicalPlan = plan resolveOperators { + case i @ InsertIntoTable(u: UnresolvedRelation, parts, child, _, _) if child.resolved => +i.copy(table = EliminateSubqueryAliases(lookupTableFromCatalog(u))) + case u: UnresolvedRelation => resolveRelation(u) +} + +// Look up the table with the given name from catalog. The database we look up the table from +// is decided follow the steps: +// 1. If the database part is defined in the table identifier, use that database name; +// 2. Else If the defaultDatabase is defined, use the default database name(In this case, no +//temporary objects can be used, and the default database is only used to look up a view); +// 3. Else use the currentDb of the SessionCatalog. --- End diff -- ``` 1. u.tableIdentifier.database, if defined 2. defaultDatabase, if defined 3. currentDb of SessionCatalog, otherwise. ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this
[GitHub] spark issue #16523: [SPARK-19142][SparkR]:spark.kmeans should take seed, ini...
Github user wangmiao1981 commented on the issue: https://github.com/apache/spark/pull/16523 cc @yanboliang --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16497: [SPARK-19118] [SQL] Percentile support for freque...
Github user tanejagagan commented on a diff in the pull request: https://github.com/apache/spark/pull/16497#discussion_r95307385 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Percentile.scala --- @@ -51,16 +51,31 @@ import org.apache.spark.util.collection.OpenHashMap _FUNC_(col, array(percentage1 [, percentage2]...)) - Returns the exact percentile value array of numeric column `col` at the given percentage(s). Each value of the percentage array must be between 0.0 and 1.0. -""") + + _FUNC_(col, frequency, percentage) - Returns the exact percentile value of numeric + column `col` with frequency column `frequency` at the given percentage. The value of + percentage must be between 0.0 and 1.0. + + _FUNC_(col, frequency, array(percentage1 [, percentage2]...)) - Returns the exact percentile + value array of numeric column `col` with frequency column `frequency` at the given + percentage(s).Each value of the percentage array must be between 0.0 and 1.0. + + """) case class Percentile( child: Expression, +frequency : Expression, --- End diff -- i think frequenctExpression would be correct name but i have sequence them as they would appear in the SQL select percentile( col, frq, percentage ) from table where frq is Optional --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16497: [SPARK-19118] [SQL] Percentile support for freque...
Github user tanejagagan commented on a diff in the pull request: https://github.com/apache/spark/pull/16497#discussion_r95307232 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Percentile.scala --- @@ -51,16 +51,31 @@ import org.apache.spark.util.collection.OpenHashMap _FUNC_(col, array(percentage1 [, percentage2]...)) - Returns the exact percentile value array of numeric column `col` at the given percentage(s). Each value of the percentage array must be between 0.0 and 1.0. -""") + + _FUNC_(col, frequency, percentage) - Returns the exact percentile value of numeric + column `col` with frequency column `frequency` at the given percentage. The value of + percentage must be between 0.0 and 1.0. + + _FUNC_(col, frequency, array(percentage1 [, percentage2]...)) - Returns the exact percentile + value array of numeric column `col` with frequency column `frequency` at the given + percentage(s).Each value of the percentage array must be between 0.0 and 1.0. + + """) case class Percentile( child: Expression, +frequency : Expression, percentageExpression: Expression, +withFrqExpr : Boolean, --- End diff -- Please see my comment below why we need to make a distinction either using flag or Option I am inclined towards using a flag because switching to option would change the code in update from val frqValue = frequency.eval(input) to val frqValue = frequency.getOrElse( unit).eval(input) But i think Option[Expression] would be better logically Once we have an agreement if we need to have a distinction or not i will make the changes accordingly --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15480: [SPARK-16845][SQL] `GeneratedClass$SpecificOrderi...
Github user lw-lin commented on a diff in the pull request: https://github.com/apache/spark/pull/15480#discussion_r95306902 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateOrdering.scala --- @@ -117,8 +117,31 @@ object GenerateOrdering extends CodeGenerator[Seq[SortOrder], Ordering[InternalR } } """ -}.mkString("\n") -comparisons +} + +ctx.splitExpressions( + expressions = comparisons, + funcName = "compare", + arguments = Seq(("InternalRow", "a"), ("InternalRow", "b")), + returnType = "int", + makeSplitFunction = { body => +s""" + InternalRow ${ctx.INPUT_ROW} = null; // Holds current row being evaluated. + $body + return 0; +""" + }, + foldFunctions = { funCalls => +val comp = ctx.freshName("comp") +funCalls.zipWithIndex.map { case (funCall, i) => + s""" +int ${comp}_$i = $funCall; --- End diff -- thanks for clarifying on this. but we'll get something like (suppose we get a fresh name `comp_1`): ```java /* 15388 */ public int compare(InternalRow a, InternalRow b) { /* 15389 */ /* 15390 */ int comp_1 = compare_0(a, b); // comp_1 /* 15391 */ if (comp_1 != 0) { /* 15392 */ return comp_1; /* 15393 */ } /* 15394 */ /* 15395 */ int comp_1 = compare_1(a, b); // still comp_1 /* 15396 */ if (comp_1 != 0) { /* 15397 */ return comp_1; /* 15398 */ } /* 1 */ /* 1 */ ... /* 1 */ /* 15450 */ return 0; /* 15451 */ } ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16487: [SPARK-19107][SQL] support creating hive table wi...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/16487#discussion_r95306656 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala --- @@ -469,9 +475,7 @@ class DataFrameReader private[sql](sparkSession: SparkSession) extends Logging { * @since 1.4.0 */ def table(tableName: String): DataFrame = { -Dataset.ofRows(sparkSession, - sparkSession.sessionState.catalog.lookupRelation( - sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName))) +sparkSession.table(tableName) --- End diff -- + 1 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15480: [SPARK-16845][SQL] `GeneratedClass$SpecificOrderi...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/15480#discussion_r95306485 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateOrdering.scala --- @@ -117,8 +117,31 @@ object GenerateOrdering extends CodeGenerator[Seq[SortOrder], Ordering[InternalR } } """ -}.mkString("\n") -comparisons +} + +ctx.splitExpressions( + expressions = comparisons, + funcName = "compare", + arguments = Seq(("InternalRow", "a"), ("InternalRow", "b")), + returnType = "int", + makeSplitFunction = { body => +s""" + InternalRow ${ctx.INPUT_ROW} = null; // Holds current row being evaluated. + $body + return 0; +""" + }, + foldFunctions = { funCalls => +val comp = ctx.freshName("comp") +funCalls.zipWithIndex.map { case (funCall, i) => + s""" +int ${comp}_$i = $funCall; --- End diff -- I mean ``` val comp = ctx.freshName("comp") funCalls.zipWithIndex.map { case (funCall, i) => s""" int $comp = $funCall; ... } ``` just remove all the `_$i` here --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16233: [SPARK-18801][SQL] Support resolve a nested view
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/16233#discussion_r95306192 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala --- @@ -125,11 +132,16 @@ private[hive] class HiveMetastoreCatalog(sparkSession: SparkSession) extends Log // Otherwise, wrap the table with a Subquery using the table name. alias.map(a => SubqueryAlias(a, qualifiedTable, None)).getOrElse(qualifiedTable) } else if (table.tableType == CatalogTableType.VIEW) { + val tableIdentifier = table.identifier val viewText = table.viewText.getOrElse(sys.error("Invalid view without text.")) - SubqueryAlias( -alias.getOrElse(table.identifier.table), -sparkSession.sessionState.sqlParser.parsePlan(viewText), -Option(table.identifier)) + // The relation is a view, so we wrap the relation by: + // 1. Add a [[View]] operator over the relation to keep track of the view desc; + // 2. Wrap the logical plan in a [[SubqueryAlias]] which tracks the name of the view. + val child = View( +desc = table, +output = table.schema.toAttributes, +child = sparkSession.sessionState.sqlParser.parsePlan(viewText)) --- End diff -- We can even add `Cast` in project list so that analyzer can report data type mismatch for us automatically. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15480: [SPARK-16845][SQL] `GeneratedClass$SpecificOrderi...
Github user lw-lin commented on a diff in the pull request: https://github.com/apache/spark/pull/15480#discussion_r95306125 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateOrdering.scala --- @@ -117,8 +117,31 @@ object GenerateOrdering extends CodeGenerator[Seq[SortOrder], Ordering[InternalR } } """ -}.mkString("\n") -comparisons +} + +ctx.splitExpressions( + expressions = comparisons, + funcName = "compare", + arguments = Seq(("InternalRow", "a"), ("InternalRow", "b")), + returnType = "int", + makeSplitFunction = { body => +s""" + InternalRow ${ctx.INPUT_ROW} = null; // Holds current row being evaluated. + $body + return 0; +""" + }, + foldFunctions = { funCalls => +val comp = ctx.freshName("comp") +funCalls.zipWithIndex.map { case (funCall, i) => + s""" +int ${comp}_$i = $funCall; --- End diff -- Ah I mis-understood it. You meant moving `ctx.freshName("comp")` into the `funCalls.zipWithIndex.map {...}`, right? like the following: ```scala // val comp = ctx.freshName("comp") // moved this into the map {...} funCalls.zipWithIndex.map { case (funCall, i) => s""" int ${ctx.freshName("comp")} = $funCall; ... } ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16233: [SPARK-18801][SQL] Support resolve a nested view
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/16233#discussion_r95306084 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala --- @@ -125,11 +132,16 @@ private[hive] class HiveMetastoreCatalog(sparkSession: SparkSession) extends Log // Otherwise, wrap the table with a Subquery using the table name. alias.map(a => SubqueryAlias(a, qualifiedTable, None)).getOrElse(qualifiedTable) } else if (table.tableType == CatalogTableType.VIEW) { + val tableIdentifier = table.identifier val viewText = table.viewText.getOrElse(sys.error("Invalid view without text.")) - SubqueryAlias( -alias.getOrElse(table.identifier.table), -sparkSession.sessionState.sqlParser.parsePlan(viewText), -Option(table.identifier)) + // The relation is a view, so we wrap the relation by: + // 1. Add a [[View]] operator over the relation to keep track of the view desc; + // 2. Wrap the logical plan in a [[SubqueryAlias]] which tracks the name of the view. + val child = View( +desc = table, +output = table.schema.toAttributes, +child = sparkSession.sessionState.sqlParser.parsePlan(viewText)) --- End diff -- sorry I may have asked this question before, but I can't recall your answer. Why do we need the `output` field of `View`? can we do something like ``` val child = sparkSession.sessionState.sqlParser.parsePlan(viewText) val projectList = schema.map { field => Alias(UnresolvedAttribute(Seq(field.name)), field.name)(explicitMetadata = Some(field.metadata)) } View( desc = table, child = Project(projectList, child)) case class View(...) { def output = child.output } ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #12064: [SPARK-14272][ML] Evaluate GaussianMixtureModel with Log...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/12064 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/71113/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #12064: [SPARK-14272][ML] Evaluate GaussianMixtureModel with Log...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/12064 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16497: [SPARK-19118] [SQL] Percentile support for freque...
Github user tanejagagan commented on a diff in the pull request: https://github.com/apache/spark/pull/16497#discussion_r95305820 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Percentile.scala --- @@ -81,7 +96,11 @@ case class Percentile( case arrayData: ArrayData => arrayData.toDoubleArray().toSeq } - override def children: Seq[Expression] = child :: percentageExpression :: Nil + override def children: Seq[Expression] = if (withFrqExpr) { --- End diff -- I have given lot of thought to it and if we need to make a difference here. Lets take a data set of age_string, age, count "20", 20, 1 "15", 15, 1 "10", 10, 1 For Sql "select percentile( age, count , 0.5 ) from table" logically correct values should be children = age::count ::0.5 :: Nil and inputType = IntegerType :: IntegerType::DoubleType::Nil For sql "select pecentile( age, 0.5 ) from table" logically correct values should be children = age::0.5 :: Nil and inputType = IntegerType ::DoubleType::Nil Here is one example where keeping it logically correct would help For following incorrect SQL "select percentile( age, '10') from table" With children = age::'10'::Nil and inputType = IntergerType::StringType:: Nil Since both children and inputType is used for dataType validation, the error message would be correct as below. "argument 2 requires Double type, however, 10 is of String type." However With children = age::Literal(1)::'10'::Nil and inputType = IntergerType::IntegerType::StringType:: Nil The error message would be NOT correct and confusing as below "argument 3 requires Double type, however, 10 is of String type." Since both children and dataType are public method i was inclined to keep them explicitly correct and therefore i decided to make a difference. Please let me know your thoughts --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #12064: [SPARK-14272][ML] Evaluate GaussianMixtureModel with Log...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/12064 **[Test build #71113 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71113/testReport)** for PR 12064 at commit [`8c2d529`](https://github.com/apache/spark/commit/8c2d5291affd75081da8152cfb6a1afc7e44a656). * This patch **fails to build**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16233: [SPARK-18801][SQL] Support resolve a nested view
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/16233#discussion_r95305635 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -510,32 +542,90 @@ class Analyzer( * Replaces [[UnresolvedRelation]]s with concrete relations from the catalog. */ object ResolveRelations extends Rule[LogicalPlan] { -private def lookupTableFromCatalog(u: UnresolvedRelation): LogicalPlan = { + +// If the unresolved relation is running directly on files, we just return the original +// UnresolvedRelation, the plan will get resolved later. Else we look up the table from catalog +// and change the default database name if it is a view. +// We usually look up a table from the default database if the table identifier has an empty +// database part, for a view the default database should be the currentDb when the view was +// created. When the case comes to resolving a nested view, the view may have different default +// database with that the referenced view has, so we need to use the variable `defaultDatabase` +// to track the current default database. +// When the relation we resolve is a view, we fetch the view.desc(which is a CatalogTable), and +// then set the value of `CatalogTable.viewDefaultDatabase` to the variable `defaultDatabase`, +// we look up the relations that the view references using the default database. +// For example: +// |- view1 (defaultDatabase = db1) +// |- operator +// |- table2 (defaultDatabase = db1) +// |- view2 (defaultDatabase = db2) +//|- view3 (defaultDatabase = db3) +// |- view4 (defaultDatabase = db4) +// In this case, the view `view1` is a nested view, it directly references `table2`ã`view2` +// and `view4`, the view `view2` references `view3`. On resolving the table, we look up the +// relations `table2`ã`view2`ã`view4` using the default database `db1`, and look up the +// relation `view3` using the default database `db2`. +// +// Note this is compatible with the views defined by older versions of Spark(before 2.2), which +// have empty defaultDatabase and all the relations in viewText have database part defined. +def resolveRelation( +plan: LogicalPlan, +defaultDatabase: Option[String] = None): LogicalPlan = plan match { + case u: UnresolvedRelation if !isRunningDirectlyOnFiles(u.tableIdentifier) => +val defaultDatabase = AnalysisContext.get.defaultDatabase +val relation = lookupTableFromCatalog(u, defaultDatabase) +resolveRelation(relation, defaultDatabase) + // The view's child should be a logical plan parsed from the `desc.viewText`, the variable + // `viewText` should be defined, or else we throw an error on the generation of the View + // operator. + case view @ View(desc, _, child) if !child.resolved => +val nestedViewLevel = AnalysisContext.get.nestedViewLevel + 1 +val context = AnalysisContext(defaultDatabase = desc.viewDefaultDatabase, + nestedViewLevel = nestedViewLevel) +// Resolve all the UnresolvedRelations and Views in the child. +val newChild = AnalysisContext.withAnalysisContext(context) { + execute(child) +} +view.copy(child = newChild) + case p @ SubqueryAlias(_, view: View, _) => +val newChild = resolveRelation(view, defaultDatabase) +p.copy(child = newChild) + case _ => plan +} + +def apply(plan: LogicalPlan): LogicalPlan = plan resolveOperators { + case i @ InsertIntoTable(u: UnresolvedRelation, parts, child, _, _) if child.resolved => +i.copy(table = EliminateSubqueryAliases(lookupTableFromCatalog(u))) + case u: UnresolvedRelation => resolveRelation(u) +} + +// Look up the table with the given name from catalog. The database we look up the table from +// is decided follow the steps: --- End diff -- Nit: `follow` -> `following` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16233: [SPARK-18801][SQL] Support resolve a nested view
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/16233#discussion_r95305370 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala --- @@ -174,6 +176,7 @@ case class CatalogTable( stats: Option[CatalogStatistics] = None, viewOriginalText: Option[String] = None, viewText: Option[String] = None, +viewDefaultDatabase: Option[String] = None, --- End diff -- can we make this a table property instead of a field? something like ``` case class CatalogTable(...) { def viewDefaultDatabase: Opion[String] = properties.get(VIEW_DEFAULT_DATABASE) } object CatalogTable { val VIEW_DEFAULT_DATABASE = "view.default.database" } ``` This is only used in a few places and I don't think it worth to add so many hacks in `HiveExternalCatalog` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16233: [SPARK-18801][SQL] Support resolve a nested view
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/16233#discussion_r95305390 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -510,32 +542,90 @@ class Analyzer( * Replaces [[UnresolvedRelation]]s with concrete relations from the catalog. */ object ResolveRelations extends Rule[LogicalPlan] { -private def lookupTableFromCatalog(u: UnresolvedRelation): LogicalPlan = { + +// If the unresolved relation is running directly on files, we just return the original +// UnresolvedRelation, the plan will get resolved later. Else we look up the table from catalog +// and change the default database name if it is a view. +// We usually look up a table from the default database if the table identifier has an empty +// database part, for a view the default database should be the currentDb when the view was +// created. When the case comes to resolving a nested view, the view may have different default +// database with that the referenced view has, so we need to use the variable `defaultDatabase` +// to track the current default database. +// When the relation we resolve is a view, we fetch the view.desc(which is a CatalogTable), and +// then set the value of `CatalogTable.viewDefaultDatabase` to the variable `defaultDatabase`, +// we look up the relations that the view references using the default database. +// For example: +// |- view1 (defaultDatabase = db1) +// |- operator +// |- table2 (defaultDatabase = db1) +// |- view2 (defaultDatabase = db2) +//|- view3 (defaultDatabase = db3) +// |- view4 (defaultDatabase = db4) +// In this case, the view `view1` is a nested view, it directly references `table2`ã`view2` +// and `view4`, the view `view2` references `view3`. On resolving the table, we look up the +// relations `table2`ã`view2`ã`view4` using the default database `db1`, and look up the +// relation `view3` using the default database `db2`. +// +// Note this is compatible with the views defined by older versions of Spark(before 2.2), which +// have empty defaultDatabase and all the relations in viewText have database part defined. +def resolveRelation( +plan: LogicalPlan, +defaultDatabase: Option[String] = None): LogicalPlan = plan match { + case u: UnresolvedRelation if !isRunningDirectlyOnFiles(u.tableIdentifier) => +val defaultDatabase = AnalysisContext.get.defaultDatabase +val relation = lookupTableFromCatalog(u, defaultDatabase) +resolveRelation(relation, defaultDatabase) + // The view's child should be a logical plan parsed from the `desc.viewText`, the variable + // `viewText` should be defined, or else we throw an error on the generation of the View + // operator. + case view @ View(desc, _, child) if !child.resolved => +val nestedViewLevel = AnalysisContext.get.nestedViewLevel + 1 +val context = AnalysisContext(defaultDatabase = desc.viewDefaultDatabase, + nestedViewLevel = nestedViewLevel) +// Resolve all the UnresolvedRelations and Views in the child. +val newChild = AnalysisContext.withAnalysisContext(context) { + execute(child) +} --- End diff -- Like [withWatrmark](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L572), we can simplify the above code by changing the interface of `withAnalysisContext`: ```Scala val newChild = AnalysisContext.withAnalysisContext(desc.viewDefaultDatabase) { execute(child) } ``` The `nestedViewLevel` can be incremented by one in `withAnalysisContext` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #12064: [SPARK-14272][ML] Evaluate GaussianMixtureModel with Log...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/12064 **[Test build #71113 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71113/testReport)** for PR 12064 at commit [`8c2d529`](https://github.com/apache/spark/commit/8c2d5291affd75081da8152cfb6a1afc7e44a656). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15505: [SPARK-18890][CORE] Move task serialization from ...
Github user witgo commented on a diff in the pull request: https://github.com/apache/spark/pull/15505#discussion_r95305125 --- Diff: core/src/test/scala/org/apache/spark/scheduler/TaskSetManagerSuite.scala --- @@ -592,47 +579,6 @@ class TaskSetManagerSuite extends SparkFunSuite with LocalSparkContext with Logg assert(manager.resourceOffer("execB", "host2", RACK_LOCAL).get.index === 1) } - test("do not emit warning when serialized task is small") { --- End diff -- Ok, I will add this test case back. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15505: [SPARK-18890][CORE] Move task serialization from the Tas...
Github user witgo commented on the issue: https://github.com/apache/spark/pull/15505 @squito In the local mode, the performance is relatively less important, we only guarantee that there will be no performance degradation on it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16233: [SPARK-18801][SQL] Support resolve a nested view
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/16233#discussion_r95304793 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/view.scala --- @@ -0,0 +1,89 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.analysis + +import org.apache.spark.sql.AnalysisException +import org.apache.spark.sql.catalyst.CatalystConf +import org.apache.spark.sql.catalyst.expressions.{Alias, Attribute, Cast} +import org.apache.spark.sql.catalyst.plans.logical.{LogicalPlan, Project, View} +import org.apache.spark.sql.catalyst.rules.Rule + +/** + * This file defines analysis rules related to views. + */ + +/** + * Make sure that a view's child plan produces the view's output attributes. We wrap the child + * with a Project and add an alias for each output attribute. The attributes are resolved by + * name. This should be only done after the resolution batch, because the view attributes are + * not stable during resolution. + */ +case class AliasViewChild(conf: CatalystConf) extends Rule[LogicalPlan] { + override def apply(plan: LogicalPlan): LogicalPlan = plan resolveOperators { +case v @ View(_, output, child) if child.resolved => + val resolver = conf.resolver + val newOutput = child.output.map { attr => +val newAttr = findAttributeByName(attr.name, output, resolver) +// Check the dataType of the output attributes, throw an AnalysisException if they don't +// match up. +checkDataType(attr, newAttr) +Alias(attr, attr.name)(exprId = newAttr.exprId, qualifier = newAttr.qualifier, + explicitMetadata = Some(newAttr.metadata)) + } + v.copy(child = Project(newOutput, child)) + } + + /** + * Find the attribute that has the expected attribute name from an attribute list, the names + * are compared using conf.resolver. + * If the expected attribute is not found, throw an AnalysisException. + */ + private def findAttributeByName( + name: String, + attrs: Seq[Attribute], + resolver: Resolver): Attribute = { +attrs.collectFirst { + case attr if resolver(attr.name, name) => attr +}.getOrElse(throw new AnalysisException( + s"Attribute with name '$name' is not found in " + +s"'${attrs.map(_.name).mkString("(", ",", ")")}'")) + } + + /** + * Check whether the dataType of `attr` could be casted to that of `other`, throw an + * AnalysisException if the both attributes don't match up. + */ + private def checkDataType(attr: Attribute, other: Attribute): Unit = { +if (!Cast.canCast(attr.dataType, other.dataType)) { --- End diff -- how did we handle view schema mismatch before? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16233: [SPARK-18801][SQL] Support resolve a nested view
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/16233#discussion_r95304745 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -510,32 +542,90 @@ class Analyzer( * Replaces [[UnresolvedRelation]]s with concrete relations from the catalog. */ object ResolveRelations extends Rule[LogicalPlan] { -private def lookupTableFromCatalog(u: UnresolvedRelation): LogicalPlan = { + +// If the unresolved relation is running directly on files, we just return the original +// UnresolvedRelation, the plan will get resolved later. Else we look up the table from catalog +// and change the default database name if it is a view. +// We usually look up a table from the default database if the table identifier has an empty +// database part, for a view the default database should be the currentDb when the view was +// created. When the case comes to resolving a nested view, the view may have different default +// database with that the referenced view has, so we need to use the variable `defaultDatabase` +// to track the current default database. +// When the relation we resolve is a view, we fetch the view.desc(which is a CatalogTable), and +// then set the value of `CatalogTable.viewDefaultDatabase` to the variable `defaultDatabase`, +// we look up the relations that the view references using the default database. +// For example: +// |- view1 (defaultDatabase = db1) +// |- operator +// |- table2 (defaultDatabase = db1) +// |- view2 (defaultDatabase = db2) +//|- view3 (defaultDatabase = db3) +// |- view4 (defaultDatabase = db4) +// In this case, the view `view1` is a nested view, it directly references `table2`ã`view2` +// and `view4`, the view `view2` references `view3`. On resolving the table, we look up the +// relations `table2`ã`view2`ã`view4` using the default database `db1`, and look up the +// relation `view3` using the default database `db2`. +// +// Note this is compatible with the views defined by older versions of Spark(before 2.2), which +// have empty defaultDatabase and all the relations in viewText have database part defined. +def resolveRelation( +plan: LogicalPlan, +defaultDatabase: Option[String] = None): LogicalPlan = plan match { + case u: UnresolvedRelation if !isRunningDirectlyOnFiles(u.tableIdentifier) => +val defaultDatabase = AnalysisContext.get.defaultDatabase +val relation = lookupTableFromCatalog(u, defaultDatabase) +resolveRelation(relation, defaultDatabase) + // The view's child should be a logical plan parsed from the `desc.viewText`, the variable + // `viewText` should be defined, or else we throw an error on the generation of the View + // operator. + case view @ View(desc, _, child) if !child.resolved => +val nestedViewLevel = AnalysisContext.get.nestedViewLevel + 1 +val context = AnalysisContext(defaultDatabase = desc.viewDefaultDatabase, + nestedViewLevel = nestedViewLevel) +// Resolve all the UnresolvedRelations and Views in the child. +val newChild = AnalysisContext.withAnalysisContext(context) { + execute(child) +} +view.copy(child = newChild) + case p @ SubqueryAlias(_, view: View, _) => +val newChild = resolveRelation(view, defaultDatabase) +p.copy(child = newChild) + case _ => plan +} + +def apply(plan: LogicalPlan): LogicalPlan = plan resolveOperators { + case i @ InsertIntoTable(u: UnresolvedRelation, parts, child, _, _) if child.resolved => +i.copy(table = EliminateSubqueryAliases(lookupTableFromCatalog(u))) + case u: UnresolvedRelation => resolveRelation(u) +} + +// Look up the table with the given name from catalog. The database we look up the table from +// is decided follow the steps: +// 1. If the database part is defined in the table identifier, use that database name; +// 2. Else If the defaultDatabase is defined, use the default database name(In this case, no +//temporary objects can be used, and the default database is only used to look up a view); +// 3. Else use the currentDb of the SessionCatalog. +private def lookupTableFromCatalog( +u: UnresolvedRelation, +defaultDatabase: Option[String] = None): LogicalPlan = { try { -catalog.lookupRelation(u.tableIdentifier, u.alias) +val tableIdentWithDb = u.tableIdentifier.withDatabase(defaultDatabase)
[GitHub] spark issue #16415: [SPARK-19007]Speedup and optimize the GradientBoostedTre...
Github user zdh2292390 commented on the issue: https://github.com/apache/spark/pull/16415 @jkbradley @srowen Can anyone check my latest commit please? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16233: [SPARK-18801][SQL] Support resolve a nested view
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/16233#discussion_r95304509 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/view.scala --- @@ -0,0 +1,90 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.analysis + +import org.apache.spark.sql.AnalysisException +import org.apache.spark.sql.catalyst.CatalystConf +import org.apache.spark.sql.catalyst.expressions.{Alias, Attribute, Cast} +import org.apache.spark.sql.catalyst.plans.logical.{LogicalPlan, Project, View} +import org.apache.spark.sql.catalyst.rules.Rule + +/** + * This file defines analysis rules related to views. + */ + +/** + * Make sure that a view's child plan produces the view's output attributes. We wrap the child + * with a Project and add an alias for each output attribute. The attributes are resolved by + * name. This should be only done after the resolution batch, because the view attributes are + * not stable during resolution. + */ +case class AliasViewChild(conf: CatalystConf) extends Rule[LogicalPlan] { + override def apply(plan: LogicalPlan): LogicalPlan = plan resolveOperators { +case v @ View(_, output, child) if child.resolved => + val resolver = conf.resolver + val newOutput = child.output.map { attr => +val newAttr = findAttributeByName(attr.name, output, resolver) +// Check the dataType of the output attributes, throw an AnalysisException if they don't +// match up. +checkDataType(attr, newAttr) +Alias(attr, attr.name)(exprId = newAttr.exprId, qualifier = newAttr.qualifier, + explicitMetadata = Some(newAttr.metadata)) + } + v.copy(child = Project(newOutput, child)) + } + + /** + * Find the attribute that has the expected attribute name from an attribute list, the names + * are compared using conf.resolver. + * If the expected attribute is not found, throw an AnalysisException. + */ + private def findAttributeByName( + name: String, + attrs: Seq[Attribute], + resolver: Resolver): Attribute = { +attrs.collectFirst { + case attr if resolver(attr.name, name) => attr +}.getOrElse(throw new AnalysisException( + s"Attribute with name '$name' is not found in " + +s"'${attrs.map(_.name).mkString("(", ",", ")")}'")) + } + + /** + * Check whether the dataType of `attr` could be casted to that of `other`, throw an + * AnalysisException if the both attributes don't match up. + */ + private def checkDataType(attr: Attribute, other: Attribute): Unit = { +if (!Cast.canCast(attr.dataType, other.dataType)) { + throw new AnalysisException( +s"The dataType of attribute '$other' is '${other.dataType}', which can't be casted to " + + s"that of '$attr', expected '${attr.dataType}'.") +} + } +} + +/** + * Removes [[View]] operators from the plan. The operator is respected till the end of analysis + * stage because we want to see which part of a analyzed logical plan is generated from a view. + */ +object EliminateView extends Rule[LogicalPlan] { + def apply(plan: LogicalPlan): LogicalPlan = plan transformUp { +// The child should have the same output attributes with the View operator, so we simply --- End diff -- > The child should have the same output attributes with the View operator shall we add an assert? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail:
[GitHub] spark pull request #16233: [SPARK-18801][SQL] Support resolve a nested view
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/16233#discussion_r95304037 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -510,32 +542,90 @@ class Analyzer( * Replaces [[UnresolvedRelation]]s with concrete relations from the catalog. */ object ResolveRelations extends Rule[LogicalPlan] { -private def lookupTableFromCatalog(u: UnresolvedRelation): LogicalPlan = { + +// If the unresolved relation is running directly on files, we just return the original +// UnresolvedRelation, the plan will get resolved later. Else we look up the table from catalog +// and change the default database name if it is a view. +// We usually look up a table from the default database if the table identifier has an empty +// database part, for a view the default database should be the currentDb when the view was +// created. When the case comes to resolving a nested view, the view may have different default +// database with that the referenced view has, so we need to use the variable `defaultDatabase` +// to track the current default database. +// When the relation we resolve is a view, we fetch the view.desc(which is a CatalogTable), and +// then set the value of `CatalogTable.viewDefaultDatabase` to the variable `defaultDatabase`, +// we look up the relations that the view references using the default database. +// For example: +// |- view1 (defaultDatabase = db1) +// |- operator +// |- table2 (defaultDatabase = db1) +// |- view2 (defaultDatabase = db2) +//|- view3 (defaultDatabase = db3) +// |- view4 (defaultDatabase = db4) +// In this case, the view `view1` is a nested view, it directly references `table2`ã`view2` +// and `view4`, the view `view2` references `view3`. On resolving the table, we look up the +// relations `table2`ã`view2`ã`view4` using the default database `db1`, and look up the +// relation `view3` using the default database `db2`. +// +// Note this is compatible with the views defined by older versions of Spark(before 2.2), which +// have empty defaultDatabase and all the relations in viewText have database part defined. +def resolveRelation( +plan: LogicalPlan, +defaultDatabase: Option[String] = None): LogicalPlan = plan match { + case u: UnresolvedRelation if !isRunningDirectlyOnFiles(u.tableIdentifier) => +val defaultDatabase = AnalysisContext.get.defaultDatabase +val relation = lookupTableFromCatalog(u, defaultDatabase) +resolveRelation(relation, defaultDatabase) + // The view's child should be a logical plan parsed from the `desc.viewText`, the variable + // `viewText` should be defined, or else we throw an error on the generation of the View + // operator. + case view @ View(desc, _, child) if !child.resolved => +val nestedViewLevel = AnalysisContext.get.nestedViewLevel + 1 +val context = AnalysisContext(defaultDatabase = desc.viewDefaultDatabase, + nestedViewLevel = nestedViewLevel) +// Resolve all the UnresolvedRelations and Views in the child. +val newChild = AnalysisContext.withAnalysisContext(context) { + execute(child) +} +view.copy(child = newChild) + case p @ SubqueryAlias(_, view: View, _) => --- End diff -- why we have this case? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15505: [SPARK-18890][CORE] Move task serialization from ...
Github user witgo commented on a diff in the pull request: https://github.com/apache/spark/pull/15505#discussion_r95304052 --- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala --- @@ -149,7 +149,12 @@ private[spark] object Utils extends Logging { /** Deserialize an object using Java serialization and the given ClassLoader */ def deserialize[T](bytes: Array[Byte], loader: ClassLoader): T = { -val bis = new ByteArrayInputStream(bytes) +deserialize(ByteBuffer.wrap(bytes), loader) + } + + /** Deserialize an object using Java serialization and the given ClassLoader */ + def deserialize[T](bytes: ByteBuffer, loader: ClassLoader): T = { +val bis = new ByteBufferInputStream(bytes) --- End diff -- I don't think there's a problem here. it is covered by many test cases. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16233: [SPARK-18801][SQL] Support resolve a nested view
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/16233#discussion_r95303945 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -510,32 +542,90 @@ class Analyzer( * Replaces [[UnresolvedRelation]]s with concrete relations from the catalog. */ object ResolveRelations extends Rule[LogicalPlan] { -private def lookupTableFromCatalog(u: UnresolvedRelation): LogicalPlan = { + +// If the unresolved relation is running directly on files, we just return the original +// UnresolvedRelation, the plan will get resolved later. Else we look up the table from catalog +// and change the default database name if it is a view. +// We usually look up a table from the default database if the table identifier has an empty +// database part, for a view the default database should be the currentDb when the view was +// created. When the case comes to resolving a nested view, the view may have different default +// database with that the referenced view has, so we need to use the variable `defaultDatabase` +// to track the current default database. +// When the relation we resolve is a view, we fetch the view.desc(which is a CatalogTable), and +// then set the value of `CatalogTable.viewDefaultDatabase` to the variable `defaultDatabase`, +// we look up the relations that the view references using the default database. +// For example: +// |- view1 (defaultDatabase = db1) +// |- operator +// |- table2 (defaultDatabase = db1) +// |- view2 (defaultDatabase = db2) +//|- view3 (defaultDatabase = db3) +// |- view4 (defaultDatabase = db4) +// In this case, the view `view1` is a nested view, it directly references `table2`ã`view2` +// and `view4`, the view `view2` references `view3`. On resolving the table, we look up the +// relations `table2`ã`view2`ã`view4` using the default database `db1`, and look up the +// relation `view3` using the default database `db2`. +// +// Note this is compatible with the views defined by older versions of Spark(before 2.2), which +// have empty defaultDatabase and all the relations in viewText have database part defined. +def resolveRelation( +plan: LogicalPlan, +defaultDatabase: Option[String] = None): LogicalPlan = plan match { + case u: UnresolvedRelation if !isRunningDirectlyOnFiles(u.tableIdentifier) => +val defaultDatabase = AnalysisContext.get.defaultDatabase +val relation = lookupTableFromCatalog(u, defaultDatabase) +resolveRelation(relation, defaultDatabase) + // The view's child should be a logical plan parsed from the `desc.viewText`, the variable + // `viewText` should be defined, or else we throw an error on the generation of the View + // operator. + case view @ View(desc, _, child) if !child.resolved => +val nestedViewLevel = AnalysisContext.get.nestedViewLevel + 1 +val context = AnalysisContext(defaultDatabase = desc.viewDefaultDatabase, + nestedViewLevel = nestedViewLevel) +// Resolve all the UnresolvedRelations and Views in the child. +val newChild = AnalysisContext.withAnalysisContext(context) { + execute(child) +} +view.copy(child = newChild) + case p @ SubqueryAlias(_, view: View, _) => +val newChild = resolveRelation(view, defaultDatabase) +p.copy(child = newChild) + case _ => plan +} + +def apply(plan: LogicalPlan): LogicalPlan = plan resolveOperators { + case i @ InsertIntoTable(u: UnresolvedRelation, parts, child, _, _) if child.resolved => +i.copy(table = EliminateSubqueryAliases(lookupTableFromCatalog(u))) + case u: UnresolvedRelation => resolveRelation(u) +} + +// Look up the table with the given name from catalog. The database we look up the table from +// is decided follow the steps: +// 1. If the database part is defined in the table identifier, use that database name; +// 2. Else If the defaultDatabase is defined, use the default database name(In this case, no +//temporary objects can be used, and the default database is only used to look up a view); +// 3. Else use the currentDb of the SessionCatalog. --- End diff -- Looks like the current code doesn't match these comments well, shall we update the comments? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature
[GitHub] spark pull request #16487: [SPARK-19107][SQL] support creating hive table wi...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/16487#discussion_r95303491 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/MetastoreDataSourcesSuite.scala --- @@ -1169,26 +1169,6 @@ class MetastoreDataSourcesSuite extends QueryTest with SQLTestUtils with TestHiv } } - test("save API - format hive") { --- End diff -- already covered by https://github.com/apache/spark/pull/16487/files#diff-b7094baa12601424a5d19cb930e3402fR1356 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16487: [SPARK-19107][SQL] support creating hive table with Data...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16487 **[Test build #71112 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71112/testReport)** for PR 16487 at commit [`27d97e5`](https://github.com/apache/spark/commit/27d97e56295465bb717da9ff08d2571998adf2b1). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16497: [SPARK-19118] [SQL] Percentile support for freque...
Github user tanejagagan commented on a diff in the pull request: https://github.com/apache/spark/pull/16497#discussion_r95303304 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Percentile.scala --- @@ -126,10 +152,15 @@ case class Percentile( buffer: OpenHashMap[Number, Long], input: InternalRow): OpenHashMap[Number, Long] = { val key = child.eval(input).asInstanceOf[Number] +val frqValue = frequency.eval(input) // Null values are ignored in counts map. -if (key != null) { - buffer.changeValue(key, 1L, _ + 1L) +if (key != null && frqValue != null) { --- End diff -- Yes this would be wrong to use the default value of 1 Let take a data set of Age, Count 20, 1 15, 1 10, 0 If we take the default value of 1L when the frq is 0 is then .5 percentile would become 15 . This is incorrect. I agree with other suggestion of either failing or disregard those values --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15480: [SPARK-16845][SQL] `GeneratedClass$SpecificOrderi...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/15480#discussion_r95303089 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateOrdering.scala --- @@ -117,8 +117,31 @@ object GenerateOrdering extends CodeGenerator[Seq[SortOrder], Ordering[InternalR } } """ -}.mkString("\n") -comparisons +} + +ctx.splitExpressions( + expressions = comparisons, + funcName = "compare", + arguments = Seq(("InternalRow", "a"), ("InternalRow", "b")), + returnType = "int", + makeSplitFunction = { body => +s""" + InternalRow ${ctx.INPUT_ROW} = null; // Holds current row being evaluated. + $body + return 0; +""" + }, + foldFunctions = { funCalls => +val comp = ctx.freshName("comp") +funCalls.zipWithIndex.map { case (funCall, i) => + s""" +int ${comp}_$i = $funCall; --- End diff -- can you double check? the implementation of `freshName` is ``` if (freshNameIds.contains(fullName)) { val id = freshNameIds(fullName) freshNameIds(fullName) = id + 1 s"$fullName$id" } else { freshNameIds += fullName -> 1 fullName } ``` it already adds an id postfix. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16497: [SPARK-19118] [SQL] Percentile support for freque...
Github user tanejagagan commented on a diff in the pull request: https://github.com/apache/spark/pull/16497#discussion_r95303054 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Percentile.scala --- @@ -126,10 +152,15 @@ case class Percentile( buffer: OpenHashMap[Number, Long], input: InternalRow): OpenHashMap[Number, Long] = { val key = child.eval(input).asInstanceOf[Number] +val frqValue = frequency.eval(input) // Null values are ignored in counts map. -if (key != null) { - buffer.changeValue(key, 1L, _ + 1L) +if (key != null && frqValue != null) { + val frqLong = frqValue.asInstanceOf[Number].longValue() + // add only when frequency is positive + if (frqLong > 0) { --- End diff -- I think the option was between either fail or disregard those values. We can certainly make this a requirement, document and fail when the values are negatives I think for the cases where values are either null or 0 we should not be adding them to Map to unnecessary bloat the map. The logic would look like if ( frqLong < 0 ) { throw new SomeException }else if( frqLong > 0 ) { // process to add them to map } Let me know if above look good and i will make the changes accordingly --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15413: [SPARK-17847][ML] Reduce shuffled data size of Ga...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/15413 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15413: [SPARK-17847][ML] Reduce shuffled data size of GaussianM...
Github user yanboliang commented on the issue: https://github.com/apache/spark/pull/15413 Merged into master. Thanks for all reviews. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15413: [SPARK-17847][ML] Reduce shuffled data size of Ga...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/15413#discussion_r95302720 --- Diff: mllib/src/test/scala/org/apache/spark/ml/clustering/GaussianMixtureSuite.scala --- @@ -126,9 +143,93 @@ class GaussianMixtureSuite extends SparkFunSuite with MLlibTestSparkContext testEstimatorAndModelReadWrite(gm, dataset, GaussianMixtureSuite.allParamSettings, checkModelData) } + + test("univariate dense/sparse data with two clusters") { +val weights = Array(2.0 / 3.0, 1.0 / 3.0) +val means = Array(Vectors.dense(5.1604), Vectors.dense(-4.3673)) +val covs = Array(Matrices.dense(1, 1, Array(0.86644)), Matrices.dense(1, 1, Array(1.1098))) +val gaussians = means.zip(covs).map { case (mean, cov) => + new MultivariateGaussian(mean, cov) +} +val expected = new GaussianMixtureModel("dummy", weights, gaussians) + +Seq(denseDataset, sparseDataset).foreach { dataset => + val actual = new GaussianMixture().setK(2).setSeed(seed).fit(dataset) + modelEquals(expected, actual) +} + } + + test("check distributed decomposition") { --- End diff -- Yeah, I also suffer from bad initialization in some of my use cases. So I think we should push to commit [SPARK-15785](https://issues.apache.org/jira/browse/SPARK-15785) firstly. It's more easy to add correctness test after we support initial model. I'll leave this as follow up and open [SPARK-19144](https://issues.apache.org/jira/browse/SPARK-19144) to track. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16233: [SPARK-18801][SQL] Support resolve a nested view
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/16233 Sorry, I have not finished the review. Need to buy a milk for my baby. Will continue the review soon. : ) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16233: [SPARK-18801][SQL] Support resolve a nested view
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/16233#discussion_r95302345 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -50,6 +50,36 @@ object SimpleAnalyzer extends Analyzer( new SimpleCatalystConf(caseSensitiveAnalysis = true)) /** + * Provides a way to keep state during the analysis, this enables us to decouple the concerns + * of analysis environment from the catalog. + * + * Note this is thread local. + * + * @param defaultDatabase The default database used in the view resolution, this overrules the + *current catalog database. + * @param nestedViewLevel The nested level in the view resolution, this enables us to limit the + *depth of nested views. + */ +case class AnalysisContext( +defaultDatabase: Option[String] = None, +nestedViewLevel: Int = 0) + +object AnalysisContext { + private val value = new ThreadLocal[AnalysisContext]() { +override def initialValue: AnalysisContext = AnalysisContext() + } + + def get: AnalysisContext = value.get() + def set(context: AnalysisContext): Unit = value.set(context) --- End diff -- This function is private, right? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16233: [SPARK-18801][SQL] Support resolve a nested view
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/16233#discussion_r95302084 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -510,32 +542,90 @@ class Analyzer( * Replaces [[UnresolvedRelation]]s with concrete relations from the catalog. */ object ResolveRelations extends Rule[LogicalPlan] { -private def lookupTableFromCatalog(u: UnresolvedRelation): LogicalPlan = { + +// If the unresolved relation is running directly on files, we just return the original +// UnresolvedRelation, the plan will get resolved later. Else we look up the table from catalog +// and change the default database name if it is a view. +// We usually look up a table from the default database if the table identifier has an empty +// database part, for a view the default database should be the currentDb when the view was +// created. When the case comes to resolving a nested view, the view may have different default +// database with that the referenced view has, so we need to use the variable `defaultDatabase` +// to track the current default database. +// When the relation we resolve is a view, we fetch the view.desc(which is a CatalogTable), and +// then set the value of `CatalogTable.viewDefaultDatabase` to the variable `defaultDatabase`, +// we look up the relations that the view references using the default database. +// For example: +// |- view1 (defaultDatabase = db1) +// |- operator +// |- table2 (defaultDatabase = db1) +// |- view2 (defaultDatabase = db2) +//|- view3 (defaultDatabase = db3) +// |- view4 (defaultDatabase = db4) +// In this case, the view `view1` is a nested view, it directly references `table2`ã`view2` +// and `view4`, the view `view2` references `view3`. On resolving the table, we look up the +// relations `table2`ã`view2`ã`view4` using the default database `db1`, and look up the +// relation `view3` using the default database `db2`. +// +// Note this is compatible with the views defined by older versions of Spark(before 2.2), which +// have empty defaultDatabase and all the relations in viewText have database part defined. +def resolveRelation( +plan: LogicalPlan, +defaultDatabase: Option[String] = None): LogicalPlan = plan match { --- End diff -- We already introduce `AnalysisContext`. It does not need this parm, right? ```Scala def resolveRelation(plan: LogicalPlan): LogicalPlan = plan match { case u: UnresolvedRelation if !isRunningDirectlyOnFiles(u.tableIdentifier) => val defaultDatabase = AnalysisContext.get.defaultDatabase val relation = lookupTableFromCatalog(u, defaultDatabase) resolveRelation(relation) // The view's child should be a logical plan parsed from the `desc.viewText`, the variable // `viewText` should be defined, or else we throw an error on the generation of the View // operator. case view @ View(desc, _, child) if !child.resolved => val nestedViewLevel = AnalysisContext.get.nestedViewLevel + 1 val context = AnalysisContext(defaultDatabase = desc.viewDefaultDatabase, nestedViewLevel = nestedViewLevel) // Resolve all the UnresolvedRelations and Views in the child. val newChild = AnalysisContext.withAnalysisContext(context) { execute(child) } view.copy(child = newChild) case p @ SubqueryAlias(_, view: View, _) => val newChild = resolveRelation(view) p.copy(child = newChild) case _ => plan } ``` Please also update the function comment. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16465: [SPARK-19064][PySpark]Fix pip installing of sub componen...
Github user holdenk commented on the issue: https://github.com/apache/spark/pull/16465 Gentle ping for @joshrosen to @davies maybe? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16487: [SPARK-19107][SQL] support creating hive table with Data...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16487 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16487: [SPARK-19107][SQL] support creating hive table with Data...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16487 **[Test build #71110 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71110/testReport)** for PR 16487 at commit [`6209d04`](https://github.com/apache/spark/commit/6209d04709aee523b8fd43652fbcc87c8908e168). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16487: [SPARK-19107][SQL] support creating hive table with Data...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16487 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/71110/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14725: [SPARK-17161] [PYSPARK][ML] Add PySpark-ML JavaWrapper c...
Github user holdenk commented on the issue: https://github.com/apache/spark/pull/14725 I don't think the wrappers are public APIs per-se, but I agree reducing the amount of boilerplate scala code required to expose the ML stuff is good if we can make it robust :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15480: [SPARK-16845][SQL] `GeneratedClass$SpecificOrderi...
Github user lw-lin commented on a diff in the pull request: https://github.com/apache/spark/pull/15480#discussion_r95300754 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/OrderingSuite.scala --- @@ -127,4 +127,17 @@ class OrderingSuite extends SparkFunSuite with ExpressionEvalHelper { } } } + + test("SPARK-16845: GeneratedClass$SpecificOrdering grows beyond 64 KB") { +val sortOrder = Literal("abc").asc + +// this is passing prior to SPARK-16845, and it should also be passing after SPARK-16845 +GenerateOrdering.generate(Array.fill(40)(sortOrder)) + +// this is FAILING prior to SPARK-16845, but it should be passing after SPARK-16845 +GenerateOrdering.generate(Array.fill(450)(sortOrder)) --- End diff -- Sure, I'll remove the `450` test case --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15480: [SPARK-16845][SQL] `GeneratedClass$SpecificOrderi...
Github user lw-lin commented on a diff in the pull request: https://github.com/apache/spark/pull/15480#discussion_r95300735 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateOrdering.scala --- @@ -117,8 +117,31 @@ object GenerateOrdering extends CodeGenerator[Seq[SortOrder], Ordering[InternalR } } """ -}.mkString("\n") -comparisons +} + +ctx.splitExpressions( + expressions = comparisons, + funcName = "compare", + arguments = Seq(("InternalRow", "a"), ("InternalRow", "b")), + returnType = "int", + makeSplitFunction = { body => +s""" + InternalRow ${ctx.INPUT_ROW} = null; // Holds current row being evaluated. + $body + return 0; +""" + }, + foldFunctions = { funCalls => +val comp = ctx.freshName("comp") +funCalls.zipWithIndex.map { case (funCall, i) => + s""" +int ${comp}_$i = $funCall; --- End diff -- It's `comp_0`, `comp_1` in the following: ```scala /* 15388 */ public int compare(InternalRow a, InternalRow b) { /* 15389 */ /* 15390 */ int comp_0 = compare_0(a, b); /* 15391 */ if (comp_0 != 0) { /* 15392 */ return comp_0; /* 15393 */ } /* 15394 */ /* 15395 */ int comp_1 = compare_1(a, b); /* 15396 */ if (comp_1 != 0) { /* 15397 */ return comp_1; /* 15398 */ } /* 1 */ /* 1 */ ... /* 1 */ /* 15450 */ return 0; /* 15451 */ } ``` so maybe we should keep this `_$i`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16514: [SPARK-19128] [SQL] Refresh Cache after Set Location
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16514 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/71109/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16514: [SPARK-19128] [SQL] Refresh Cache after Set Location
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16514 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16233: [SPARK-18801][SQL] Support resolve a nested view
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/16233#discussion_r95300580 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -50,6 +50,36 @@ object SimpleAnalyzer extends Analyzer( new SimpleCatalystConf(caseSensitiveAnalysis = true)) /** + * Provides a way to keep state during the analysis, this enables us to decouple the concerns + * of analysis environment from the catalog. + * + * Note this is thread local. + * + * @param defaultDatabase The default database used in the view resolution, this overrules the + *current catalog database. + * @param nestedViewLevel The nested level in the view resolution, this enables us to limit the + *depth of nested views. --- End diff -- This is not being used, right? If so, could you update the comment? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16514: [SPARK-19128] [SQL] Refresh Cache after Set Location
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16514 **[Test build #71109 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71109/testReport)** for PR 16514 at commit [`0805cf6`](https://github.com/apache/spark/commit/0805cf6300953ac813ca9d10b76948a42af02a95). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16255: [SPARK-18609][SQL]Fix when CTE with Join between ...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/16255#discussion_r95300466 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala --- @@ -200,6 +200,8 @@ object RemoveAliasOnlyProject extends Rule[LogicalPlan] { case plan: Project if plan eq proj => plan.child case plan => plan transformExpressions { case a: Attribute if attrMap.contains(a) => attrMap(a) + case b: Alias if attrMap.exists(_._1.exprId == b.exprId) +&& b.child.isInstanceOf[NamedExpression] => b.child --- End diff -- I don't get it, for `alias a#1 to a#2`, we wanna `replace all a#2 with a#1`, so we will do nothing for `alias a#1 to a#2`, because we can't find an attribute `a#2` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15480: [SPARK-16845][SQL] `GeneratedClass$SpecificOrderi...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/15480#discussion_r95300097 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/OrderingSuite.scala --- @@ -127,4 +127,17 @@ class OrderingSuite extends SparkFunSuite with ExpressionEvalHelper { } } } + + test("SPARK-16845: GeneratedClass$SpecificOrdering grows beyond 64 KB") { +val sortOrder = Literal("abc").asc + +// this is passing prior to SPARK-16845, and it should also be passing after SPARK-16845 +GenerateOrdering.generate(Array.fill(40)(sortOrder)) + +// this is FAILING prior to SPARK-16845, but it should be passing after SPARK-16845 +GenerateOrdering.generate(Array.fill(450)(sortOrder)) --- End diff -- This is unnecessary, it's covered by the `5000` test case. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15480: [SPARK-16845][SQL] `GeneratedClass$SpecificOrdering` gro...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/15480 LGTM --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15480: [SPARK-16845][SQL] `GeneratedClass$SpecificOrderi...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/15480#discussion_r9536 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateOrdering.scala --- @@ -117,8 +117,31 @@ object GenerateOrdering extends CodeGenerator[Seq[SortOrder], Ordering[InternalR } } """ -}.mkString("\n") -comparisons +} + +ctx.splitExpressions( + expressions = comparisons, + funcName = "compare", + arguments = Seq(("InternalRow", "a"), ("InternalRow", "b")), + returnType = "int", + makeSplitFunction = { body => +s""" + InternalRow ${ctx.INPUT_ROW} = null; // Holds current row being evaluated. + $body + return 0; +""" + }, + foldFunctions = { funCalls => +val comp = ctx.freshName("comp") +funCalls.zipWithIndex.map { case (funCall, i) => + s""" +int ${comp}_$i = $funCall; --- End diff -- nit: `ctx.freshName` already adds postfix to the name, you don't need to add `_$i` again. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16481: [SPARK-19092] [SQL] Save() API of DataFrameWriter...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/16481#discussion_r95299609 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala --- @@ -413,17 +413,22 @@ case class DataSource( relation } - /** Writes the given [[DataFrame]] out to this [[DataSource]]. */ + /** + * Writes the given [[DataFrame]] out to this [[DataSource]]. + * + * @param isForWriteOnly Whether to just write the data without returning a [[BaseRelation]]. + */ def write( mode: SaveMode, - data: DataFrame): BaseRelation = { + data: DataFrame, + isForWriteOnly: Boolean = false): Option[BaseRelation] = { if (data.schema.map(_.dataType).exists(_.isInstanceOf[CalendarIntervalType])) { throw new AnalysisException("Cannot save interval data type into external storage.") } providingClass.newInstance() match { case dataSource: CreatableRelationProvider => -dataSource.createRelation(sparkSession.sqlContext, mode, caseInsensitiveOptions, data) +Some(dataSource.createRelation(sparkSession.sqlContext, mode, caseInsensitiveOptions, data)) --- End diff -- it would be really weird if `CreatableRelationProvider.createRelation` can return a relation with different schema from the written `data`. Is it safe to assume the schema won't change? cc @marmbrus @yhuai @liancheng --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16500: [SPARK-19120] [SPARK-19121] Refresh Metadata Cache After...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/16500 I'm wondering if we need the metadata cache anymore. Now we store partitions in the metastore, and have a cache for leaf files, what's the benefit of metadata cache? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16487: [SPARK-19107][SQL] support creating hive table wi...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/16487#discussion_r95298994 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala --- @@ -385,6 +380,8 @@ final class DataFrameWriter[T] private[sql](ds: Dataset[T]) { } EliminateSubqueryAliases(catalog.lookupRelation(tableIdentWithDB)) match { // Only do the check if the table is a data source table (the relation is a BaseRelation). + // TODO(cloud-fan): also check hive table relation here when we support overwrite mode + // for creating hive tables. --- End diff -- We should not block it. This generates `InsertIntoTable`, and it supports hive table. What we should block is `saveAsTable` with `Overwrite` mode, which generates `CreateTable`. `insert overwrite` is different from `create table with overwrite mode` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16517: [SPARK-18243][SQL] Port Hive writing to use FileFormat i...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16517 **[Test build #7 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/7/testReport)** for PR 16517 at commit [`36c9269`](https://github.com/apache/spark/commit/36c9269c45b4bdca3be36a4e73bf48f8e273a9a3). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16517: [SPARK-18243][SQL] Port Hive writing to use FileF...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/16517#discussion_r95298791 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveWriterContainers.scala --- @@ -1,356 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - *http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -package org.apache.spark.sql.hive - -import java.text.NumberFormat -import java.util.{Date, Locale} - -import scala.collection.JavaConverters._ - -import org.apache.hadoop.fs.Path -import org.apache.hadoop.hive.common.FileUtils -import org.apache.hadoop.hive.conf.HiveConf.ConfVars -import org.apache.hadoop.hive.ql.exec.{FileSinkOperator, Utilities} -import org.apache.hadoop.hive.ql.io.{HiveFileFormatUtils, HiveOutputFormat} -import org.apache.hadoop.hive.ql.plan.TableDesc -import org.apache.hadoop.hive.serde2.Serializer -import org.apache.hadoop.hive.serde2.objectinspector.{ObjectInspectorUtils, StructObjectInspector} -import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorUtils.ObjectInspectorCopyOption -import org.apache.hadoop.io.Writable -import org.apache.hadoop.mapred._ -import org.apache.hadoop.mapreduce.TaskType - -import org.apache.spark._ -import org.apache.spark.internal.Logging -import org.apache.spark.internal.io.SparkHadoopWriterUtils -import org.apache.spark.mapred.SparkHadoopMapRedUtil -import org.apache.spark.sql.catalyst.InternalRow -import org.apache.spark.sql.catalyst.expressions._ -import org.apache.spark.sql.execution.UnsafeKVExternalSorter -import org.apache.spark.sql.hive.HiveShim.{ShimFileSinkDesc => FileSinkDesc} -import org.apache.spark.sql.types._ -import org.apache.spark.util.SerializableJobConf -import org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter - -/** - * Internal helper class that saves an RDD using a Hive OutputFormat. - * It is based on `SparkHadoopWriter`. - */ -private[hive] class SparkHiveWriterContainer( -@transient private val jobConf: JobConf, -fileSinkConf: FileSinkDesc, -inputSchema: Seq[Attribute]) - extends Logging - with HiveInspectors - with Serializable { - - private val now = new Date() - private val tableDesc: TableDesc = fileSinkConf.getTableInfo - // Add table properties from storage handler to jobConf, so any custom storage - // handler settings can be set to jobConf - if (tableDesc != null) { -HiveTableUtil.configureJobPropertiesForStorageHandler(tableDesc, jobConf, false) -Utilities.copyTableJobPropertiesToConf(tableDesc, jobConf) - } - protected val conf = new SerializableJobConf(jobConf) - - private var jobID = 0 - private var splitID = 0 - private var attemptID = 0 - private var jID: SerializableWritable[JobID] = null - private var taID: SerializableWritable[TaskAttemptID] = null - - @transient private var writer: FileSinkOperator.RecordWriter = null - @transient protected lazy val committer = conf.value.getOutputCommitter - @transient protected lazy val jobContext = new JobContextImpl(conf.value, jID.value) - @transient private lazy val taskContext = new TaskAttemptContextImpl(conf.value, taID.value) - @transient private lazy val outputFormat = -conf.value.getOutputFormat.asInstanceOf[HiveOutputFormat[AnyRef, Writable]] - - def driverSideSetup() { -setIDs(0, 0, 0) -setConfParams() -committer.setupJob(jobContext) - } - - def executorSideSetup(jobId: Int, splitId: Int, attemptId: Int) { -setIDs(jobId, splitId, attemptId) -setConfParams() -committer.setupTask(taskContext) -initWriters() - } - - protected def getOutputName: String = { -val numberFormat = NumberFormat.getInstance(Locale.US) -numberFormat.setMinimumIntegerDigits(5) -numberFormat.setGroupingUsed(false) -val extension = Utilities.getFileExtension(conf.value, fileSinkConf.getCompressed, outputFormat)
[GitHub] spark issue #14725: [SPARK-17161] [PYSPARK][ML] Add PySpark-ML JavaWrapper c...
Github user BryanCutler commented on the issue: https://github.com/apache/spark/pull/14725 Sure, I can add a better docstring. This is just for developers and doesn't have to be used, but it can be used to avoid creating more Java-friendly functions only because they have arrays - which happens a lot in mllib. Anything to help limit public APIs is very useful, imho. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16487: [SPARK-19107][SQL] support creating hive table wi...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/16487#discussion_r95298747 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala --- @@ -385,6 +380,8 @@ final class DataFrameWriter[T] private[sql](ds: Dataset[T]) { } EliminateSubqueryAliases(catalog.lookupRelation(tableIdentWithDB)) match { // Only do the check if the table is a data source table (the relation is a BaseRelation). + // TODO(cloud-fan): also check hive table relation here when we support overwrite mode + // for creating hive tables. --- End diff -- Although we ignore the specified provider, we still respect the actual format of the table. For example, below is the Hive table. We are not blocking it. Should we block it to make them consistent? ```Scala sql(s"CREATE TABLE $tableName STORED AS SEQUENCEFILE AS SELECT 1 AS key, 'abc' AS value") val df = sql(s"SELECT key, value FROM $tableName") df.write.mode("overwrite").insertInto(tableName) ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15505: [SPARK-18890][CORE] Move task serialization from ...
Github user witgo commented on a diff in the pull request: https://github.com/apache/spark/pull/15505#discussion_r95298151 --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskDescription.scala --- @@ -52,7 +55,36 @@ private[spark] class TaskDescription( val addedFiles: Map[String, Long], val addedJars: Map[String, Long], val properties: Properties, -val serializedTask: ByteBuffer) { +private var serializedTask_ : ByteBuffer, +private var task_ : Task[_] = null) extends Logging { --- End diff -- @squito I agree with you, I will modify it. @kayousterhout The main reason for this is to reduce the amount of code changes, we need to change more code files to implement what you say. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15211: [SPARK-14709][ML] spark.ml API for linear SVM
Github user hhbyyh commented on the issue: https://github.com/apache/spark/pull/15211 Sent an update to include a R unit test. Yet I met a problem that there's a constant scaling difference between LinearSVC and R 1071 (which essentially is LibSVM). It's possible that it's caused by some parameter setting. Post it anyway to see if there's any suggestions. Sorry @zhengruifeng, I'll address your comment in the next update. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16497: [SPARK-19118] [SQL] Percentile support for freque...
Github user hvanhovell commented on a diff in the pull request: https://github.com/apache/spark/pull/16497#discussion_r95297133 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Percentile.scala --- @@ -51,16 +51,31 @@ import org.apache.spark.util.collection.OpenHashMap _FUNC_(col, array(percentage1 [, percentage2]...)) - Returns the exact percentile value array of numeric column `col` at the given percentage(s). Each value of the percentage array must be between 0.0 and 1.0. -""") + + _FUNC_(col, frequency, percentage) - Returns the exact percentile value of numeric + column `col` with frequency column `frequency` at the given percentage. The value of + percentage must be between 0.0 and 1.0. + + _FUNC_(col, frequency, array(percentage1 [, percentage2]...)) - Returns the exact percentile + value array of numeric column `col` with frequency column `frequency` at the given + percentage(s).Each value of the percentage array must be between 0.0 and 1.0. + + """) case class Percentile( child: Expression, +frequency : Expression, percentageExpression: Expression, +withFrqExpr : Boolean, --- End diff -- Please remove withFrqExpr. The frequency must be provided, and should default to 1L. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16497: [SPARK-19118] [SQL] Percentile support for freque...
Github user hvanhovell commented on a diff in the pull request: https://github.com/apache/spark/pull/16497#discussion_r95297232 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Percentile.scala --- @@ -81,7 +96,11 @@ case class Percentile( case arrayData: ArrayData => arrayData.toDoubleArray().toSeq } - override def children: Seq[Expression] = child :: percentageExpression :: Nil + override def children: Seq[Expression] = if (withFrqExpr) { --- End diff -- Why do we need to make a difference here? Just do: ```scala override def children: Seq[Expression] = { child :: frequency :: percentageExpression :: Nil } --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16497: [SPARK-19118] [SQL] Percentile support for freque...
Github user hvanhovell commented on a diff in the pull request: https://github.com/apache/spark/pull/16497#discussion_r95297644 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Percentile.scala --- @@ -91,9 +110,16 @@ case class Percentile( case _ => DoubleType } - override def inputTypes: Seq[AbstractDataType] = percentageExpression.dataType match { -case _: ArrayType => Seq(NumericType, ArrayType(DoubleType)) -case _ => Seq(NumericType, DoubleType) + override def inputTypes: Seq[AbstractDataType] = { +val percentageExpType = percentageExpression.dataType match { + case _: ArrayType => ArrayType(DoubleType) + case _ => DoubleType +} +if (withFrqExpr) { --- End diff -- Again remove withFrqExpr. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16497: [SPARK-19118] [SQL] Percentile support for freque...
Github user hvanhovell commented on a diff in the pull request: https://github.com/apache/spark/pull/16497#discussion_r95297633 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Percentile.scala --- @@ -126,10 +152,15 @@ case class Percentile( buffer: OpenHashMap[Number, Long], input: InternalRow): OpenHashMap[Number, Long] = { val key = child.eval(input).asInstanceOf[Number] +val frqValue = frequency.eval(input) // Null values are ignored in counts map. -if (key != null) { - buffer.changeValue(key, 1L, _ + 1L) +if (key != null && frqValue != null) { + val frqLong = frqValue.asInstanceOf[Number].longValue() + // add only when frequency is positive + if (frqLong > 0) { --- End diff -- Lets make this a requirement and fail when the value < 0. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16497: [SPARK-19118] [SQL] Percentile support for freque...
Github user hvanhovell commented on a diff in the pull request: https://github.com/apache/spark/pull/16497#discussion_r95297609 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Percentile.scala --- @@ -126,10 +152,15 @@ case class Percentile( buffer: OpenHashMap[Number, Long], input: InternalRow): OpenHashMap[Number, Long] = { val key = child.eval(input).asInstanceOf[Number] +val frqValue = frequency.eval(input) // Null values are ignored in counts map. -if (key != null) { - buffer.changeValue(key, 1L, _ + 1L) +if (key != null && frqValue != null) { --- End diff -- I don't think we should default to 1L as a value. That seems more wrong than either failing or skipping. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16487: [SPARK-19107][SQL] support creating hive table with Data...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16487 **[Test build #71110 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71110/testReport)** for PR 16487 at commit [`6209d04`](https://github.com/apache/spark/commit/6209d04709aee523b8fd43652fbcc87c8908e168). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16497: [SPARK-19118] [SQL] Percentile support for freque...
Github user jiangxb1987 commented on a diff in the pull request: https://github.com/apache/spark/pull/16497#discussion_r95295321 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Percentile.scala --- @@ -126,10 +152,15 @@ case class Percentile( buffer: OpenHashMap[Number, Long], input: InternalRow): OpenHashMap[Number, Long] = { val key = child.eval(input).asInstanceOf[Number] +val frqValue = frequency.eval(input) --- End diff -- We should check the value of the `frequencyExpression` beforehand. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16497: [SPARK-19118] [SQL] Percentile support for freque...
Github user jiangxb1987 commented on a diff in the pull request: https://github.com/apache/spark/pull/16497#discussion_r95294516 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Percentile.scala --- @@ -51,16 +51,31 @@ import org.apache.spark.util.collection.OpenHashMap _FUNC_(col, array(percentage1 [, percentage2]...)) - Returns the exact percentile value array of numeric column `col` at the given percentage(s). Each value of the percentage array must be between 0.0 and 1.0. -""") + + _FUNC_(col, frequency, percentage) - Returns the exact percentile value of numeric + column `col` with frequency column `frequency` at the given percentage. The value of + percentage must be between 0.0 and 1.0. + + _FUNC_(col, frequency, array(percentage1 [, percentage2]...)) - Returns the exact percentile + value array of numeric column `col` with frequency column `frequency` at the given + percentage(s).Each value of the percentage array must be between 0.0 and 1.0. + + """) case class Percentile( child: Expression, +frequency : Expression, --- End diff -- Rename this to `frequencyExpression`, and place this after `percentageExpression`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org