[GitHub] spark pull request #16641: Merge pull request #1 from apache/master
Github user someorz closed the pull request at: https://github.com/apache/spark/pull/16641 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16641: Merge pull request #1 from apache/master
GitHub user someorz opened a pull request: https://github.com/apache/spark/pull/16641 Merge pull request #1 from apache/master update ## What changes were proposed in this pull request? (Please fill in changes proposed in this fix) ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/someorz/spark master Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/16641.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #16641 commit 65c6538a24b42fd4de934623553155ada76125e7 Author: someorz <24164...@qq.com> Date: 2016-10-17T07:52:29Z Merge pull request #1 from apache/master update --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16344 **[Test build #71643 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71643/testReport)** for PR 16344 at commit [`83deee3`](https://github.com/apache/spark/commit/83deee352c46ec113554fccee4bdc14ead56072e). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16640: Merge pull request #1 from apache/master
Github user someorz closed the pull request at: https://github.com/apache/spark/pull/16640 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16640: Merge pull request #1 from apache/master
Github user someorz commented on the issue: https://github.com/apache/spark/pull/16640 update --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16640: Merge pull request #1 from apache/master
GitHub user someorz opened a pull request: https://github.com/apache/spark/pull/16640 Merge pull request #1 from apache/master update ## What changes were proposed in this pull request? (Please fill in changes proposed in this fix) ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/someorz/spark master Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/16640.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #16640 commit 65c6538a24b42fd4de934623553155ada76125e7 Author: someorz <24164...@qq.com> Date: 2016-10-17T07:52:29Z Merge pull request #1 from apache/master update --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16593: [SPARK-19153][SQL]DataFrameWriter.saveAsTable work with ...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/16593 @windpiger Could you do me a favor to add a dedicated test case in this PR? - Create a partitinoed Hive Table - Create a partitinoed data source Table - Create a partitinoed Hive Table As SELECT - Create a partitinoed data source Table AS SELECT I want to see whether all of them follow the same rule: - data columns + partitino columns - the order of data columns is based on the user-specified order in either schema (CT) or query (CTAS) - the order of parttin columns is based on the order of columns specified in the clause of `PARTITIONED BY`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM
Github user MLnick commented on the issue: https://github.com/apache/spark/pull/16344 jenkins test this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM
Github user MLnick commented on the issue: https://github.com/apache/spark/pull/16344 jenkins add to whitelist --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16593: [SPARK-19153][SQL]DataFrameWriter.saveAsTable wor...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/16593#discussion_r96805975 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/CreateHiveTableAsSelectCommand.scala --- @@ -87,8 +101,8 @@ case class CreateHiveTableAsSelectCommand( } } else { try { -sparkSession.sessionState.executePlan(InsertIntoTable( - metastoreRelation, Map(), query, overwrite = true, ifNotExists = false)).toRdd + sparkSession.sessionState.executePlan(InsertIntoTable(metastoreRelation, --- End diff -- Yeah. Agree --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16593: [SPARK-19153][SQL]DataFrameWriter.saveAsTable wor...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/16593#discussion_r96805893 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/CreateHiveTableAsSelectCommand.scala --- @@ -64,7 +77,7 @@ case class CreateHiveTableAsSelectCommand( val withSchema = if (withFormat.schema.isEmpty) { // Hive doesn't support specifying the column list for target table in CTAS // However we don't think SparkSQL should follow that. --- End diff -- We need to update the above comment. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15415: [SPARK-14503][ML] spark.ml API for FPGrowth
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15415 **[Test build #71642 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71642/testReport)** for PR 15415 at commit [`3273b76`](https://github.com/apache/spark/commit/3273b76c3d818636a822f98ecd3df0706a4cae26). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #12064: [SPARK-14272][ML] Add Loglikelihood in GaussianMixtureSu...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/12064 **[Test build #71641 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71641/testReport)** for PR 12064 at commit [`cbec946`](https://github.com/apache/spark/commit/cbec946583536283bf31dd5fb4f61b724e502e68). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15415: [SPARK-14503][ML] spark.ml API for FPGrowth
Github user hhbyyh commented on a diff in the pull request: https://github.com/apache/spark/pull/15415#discussion_r96804046 --- Diff: mllib/src/main/scala/org/apache/spark/ml/fpm/FPGrowth.scala --- @@ -0,0 +1,232 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.fpm + +import org.apache.hadoop.fs.Path + +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.{Estimator, Model} +import org.apache.spark.ml.param.{DoubleParam, ParamMap, Params} +import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasPredictionCol} +import org.apache.spark.ml.util._ +import org.apache.spark.mllib.fpm.{FPGrowth => MLlibFPGrowth, FPGrowthModel => MLlibFPGrowthModel} +import org.apache.spark.sql.{DataFrame, _} +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.types.{ArrayType, StringType, StructType} + +/** + * Common params for FPGrowth and FPGrowthModel + */ +private[fpm] trait FPGrowthParams extends Params with HasFeaturesCol with HasPredictionCol { + + /** + * Validates and transforms the input schema. + * @param schema input schema + * @return output schema + */ + protected def validateAndTransformSchema(schema: StructType): StructType = { +SchemaUtils.checkColumnType(schema, $(featuresCol), new ArrayType(StringType, false)) +SchemaUtils.appendColumn(schema, $(predictionCol), new ArrayType(StringType, false)) + } + + /** + * the minimal support level of the frequent pattern + * Default: 0.3 + * @group param + */ + @Since("2.2.0") + val minSupport: DoubleParam = new DoubleParam(this, "minSupport", +"the minimal support level of the frequent pattern (Default: 0.3)") + + /** @group getParam */ + @Since("2.2.0") + def getMinSupport: Double = $(minSupport) + +} + +/** + * :: Experimental :: + * A parallel FP-growth algorithm to mine frequent itemsets. + * + * @see [[http://dx.doi.org/10.1145/1454008.1454027 Li et al., PFP: Parallel FP-Growth for Query + * Recommendation]] + */ +@Since("2.2.0") +@Experimental +class FPGrowth @Since("2.2.0") ( +@Since("2.2.0") override val uid: String) + extends Estimator[FPGrowthModel] with FPGrowthParams with DefaultParamsWritable { + + @Since("2.2.0") + def this() = this(Identifiable.randomUID("FPGrowth")) + + /** @group setParam */ + @Since("2.2.0") + def setMinSupport(value: Double): this.type = set(minSupport, value) + setDefault(minSupport -> 0.3) + + /** @group setParam */ + @Since("2.2.0") + def setFeaturesCol(value: String): this.type = set(featuresCol, value) + + /** @group setParam */ + @Since("2.2.0") + def setPredictionCol(value: String): this.type = set(predictionCol, value) + + def fit(dataset: Dataset[_]): FPGrowthModel = { +val data = dataset.select($(featuresCol)).rdd.map(r => r.getSeq[String](0).toArray) +val parentModel = new MLlibFPGrowth().setMinSupport($(minSupport)).run(data) +copyValues(new FPGrowthModel(uid, parentModel)) + } + + @Since("2.2.0") + override def transformSchema(schema: StructType): StructType = { +validateAndTransformSchema(schema) + } + + override def copy(extra: ParamMap): FPGrowth = defaultCopy(extra) +} + + +@Since("2.2.0") +object FPGrowth extends DefaultParamsReadable[FPGrowth] { + + @Since("2.2.0") + override def load(path: String): FPGrowth = super.load(path) +} + +/** + * :: Experimental :: + * Model fitted by FPGrowth. + * + * @param parentModel a model trained by spark.mllib.fpm.FPGrowth + */ +@Since("2.2.0") +@Experimental +class FPGrowthModel private[ml] ( +@Since("2.2.0") override val uid: String, +private val parentModel: MLlibFPGrowthModel[_]) + extends Model[FPGrowthModel] with FPGrowthParams with
[GitHub] spark pull request #15415: [SPARK-14503][ML] spark.ml API for FPGrowth
Github user hhbyyh commented on a diff in the pull request: https://github.com/apache/spark/pull/15415#discussion_r96803812 --- Diff: mllib/src/main/scala/org/apache/spark/ml/fpm/FPGrowth.scala --- @@ -0,0 +1,232 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.fpm + +import org.apache.hadoop.fs.Path + +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.{Estimator, Model} +import org.apache.spark.ml.param.{DoubleParam, ParamMap, Params} +import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasPredictionCol} +import org.apache.spark.ml.util._ +import org.apache.spark.mllib.fpm.{FPGrowth => MLlibFPGrowth, FPGrowthModel => MLlibFPGrowthModel} +import org.apache.spark.sql.{DataFrame, _} +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.types.{ArrayType, StringType, StructType} + +/** + * Common params for FPGrowth and FPGrowthModel + */ +private[fpm] trait FPGrowthParams extends Params with HasFeaturesCol with HasPredictionCol { + + /** + * Validates and transforms the input schema. + * @param schema input schema + * @return output schema + */ + protected def validateAndTransformSchema(schema: StructType): StructType = { +SchemaUtils.checkColumnType(schema, $(featuresCol), new ArrayType(StringType, false)) +SchemaUtils.appendColumn(schema, $(predictionCol), new ArrayType(StringType, false)) + } + + /** + * the minimal support level of the frequent pattern + * Default: 0.3 + * @group param + */ + @Since("2.2.0") + val minSupport: DoubleParam = new DoubleParam(this, "minSupport", +"the minimal support level of the frequent pattern (Default: 0.3)") + + /** @group getParam */ + @Since("2.2.0") + def getMinSupport: Double = $(minSupport) + +} + +/** + * :: Experimental :: + * A parallel FP-growth algorithm to mine frequent itemsets. + * + * @see [[http://dx.doi.org/10.1145/1454008.1454027 Li et al., PFP: Parallel FP-Growth for Query + * Recommendation]] + */ +@Since("2.2.0") +@Experimental +class FPGrowth @Since("2.2.0") ( +@Since("2.2.0") override val uid: String) + extends Estimator[FPGrowthModel] with FPGrowthParams with DefaultParamsWritable { + + @Since("2.2.0") + def this() = this(Identifiable.randomUID("FPGrowth")) + + /** @group setParam */ + @Since("2.2.0") + def setMinSupport(value: Double): this.type = set(minSupport, value) + setDefault(minSupport -> 0.3) + + /** @group setParam */ + @Since("2.2.0") + def setFeaturesCol(value: String): this.type = set(featuresCol, value) --- End diff -- Thanks. Let's collect more feedback about it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16639: [SPARK-19276][CORE] Fetch Failure handling robust to use...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16639 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16639: [SPARK-19276][CORE] Fetch Failure handling robust to use...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16639 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/71640/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16639: [SPARK-19276][CORE] Fetch Failure handling robust to use...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16639 **[Test build #71640 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71640/testReport)** for PR 16639 at commit [`9635980`](https://github.com/apache/spark/commit/9635980fca20d18b44fa5085996ad43cbf3f3bb5). * This patch **fails MiMa tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16633: [SPARK-19274][SQL] Make GlobalLimit without shuffling da...
Github user scwf commented on the issue: https://github.com/apache/spark/pull/16633 need define a new map output statistics to do this --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16633: [SPARK-19274][SQL] Make GlobalLimit without shuffling da...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/16633 @scwf I don't think it would work. map output statistics is just approximate number of output bytes. You can't use it to get correct row number. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16639: [SPARK-19276][CORE] Fetch Failure handling robust to use...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16639 **[Test build #71638 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71638/testReport)** for PR 16639 at commit [`b93c37f`](https://github.com/apache/spark/commit/b93c37f1b0dfd3f1293b7a3df8beacc2fec7a33d). * This patch **fails MiMa tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16639: [SPARK-19276][CORE] Fetch Failure handling robust to use...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16639 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/71638/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15415: [SPARK-14503][ML] spark.ml API for FPGrowth
Github user hhbyyh commented on a diff in the pull request: https://github.com/apache/spark/pull/15415#discussion_r96802011 --- Diff: mllib/src/main/scala/org/apache/spark/ml/fpm/AssociationRules.scala --- @@ -0,0 +1,113 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.fpm + +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.param.{DoubleParam, Param, ParamMap, Params} +import org.apache.spark.ml.util.Identifiable +import org.apache.spark.mllib.fpm.{AssociationRules => MLlibAssociationRules} +import org.apache.spark.mllib.fpm.FPGrowth.FreqItemset +import org.apache.spark.sql.{DataFrame, Dataset, SparkSession} + +/** + * :: Experimental :: + * + * Generates association rules from frequent itemsets ("items", "freq"). This method only generates + * association rules which have a single item as the consequent. + */ +@Since("2.1.0") +@Experimental +class AssociationRules(override val uid: String) extends Params { --- End diff -- `freqItemsets` and `rules` does not have a one-to-one mapping relation and will probably violates the primitives of Transformer. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16639: [SPARK-19276][CORE] Fetch Failure handling robust to use...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16639 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16639: [SPARK-19276][CORE] Fetch Failure handling robust to use...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16639 **[Test build #71640 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71640/testReport)** for PR 16639 at commit [`9635980`](https://github.com/apache/spark/commit/9635980fca20d18b44fa5085996ad43cbf3f3bb5). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16639: [SPARK-19276][CORE] Fetch Failure handling robust to use...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16639 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16639: [SPARK-19276][CORE] Fetch Failure handling robust to use...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16639 **[Test build #71637 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71637/testReport)** for PR 16639 at commit [`5c28b62`](https://github.com/apache/spark/commit/5c28b62941fc08438e16a92bc70041636cc1dbee). * This patch **fails MiMa tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16639: [SPARK-19276][CORE] Fetch Failure handling robust to use...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16639 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/71637/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16639: [SPARK-19276][CORE] Fetch Failure handling robust to use...
Github user squito commented on the issue: https://github.com/apache/spark/pull/16639 cc @kayousterhout @markhamstra @mateiz This isn't just protecting against crazy user code -- I've seen users hit this with spark sql (because of https://github.com/apache/spark/blob/278fa1eb305220a85c816c948932d6af8fa619aa/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala#L214), so it seems important to fix. I attempted to write a larger integration test, which reproduced the issue in a "local-cluster" setup, but got stuck. ShuffleBlockFetcherIterator does _some_ fetches on construction, before its used as an iterator wrapped in user code. So if the failures happen during that initialization, everything was fine before. The failure has to happen inside the call to `shuffleBlockFetcherIterator.next()` when its called by the user's iterator for the error to happen. I eventually was able to reproduce it with this https://github.com/squito/spark/commit/c2d27d10f32edf70e78d849967f7b7bf51495c4e but it involved hacking internals and didn't seem easy to get into a test. I settled for a simpler unit test just on `Executor`, but open to more suggestions. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16621: [SPARK-19265][SQL] make table relation cache gene...
Github user hvanhovell commented on a diff in the pull request: https://github.com/apache/spark/pull/16621#discussion_r96800976 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala --- @@ -586,12 +594,12 @@ class SessionCatalog( desc = metadata, output = metadata.schema.toAttributes, child = parser.parsePlan(viewText)) - SubqueryAlias(relationAlias, child, Option(name)) + SubqueryAlias(relationAlias, child, Some(name.copy(table = table, database = Some(db } else { SubqueryAlias(relationAlias, SimpleCatalogRelation(metadata), None) } } else { -SubqueryAlias(relationAlias, tempTables(table), Option(name)) +SubqueryAlias(relationAlias, tempTables(table), None) --- End diff -- Sorry, I have been living under a rock in the past month or so. This is not really needed anymore. Lets remove it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16552: [SPARK-19152][SQL]DataFrameWriter.saveAsTable support hi...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16552 **[Test build #71639 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71639/testReport)** for PR 16552 at commit [`2bf67c7`](https://github.com/apache/spark/commit/2bf67c722b4d57f446b54fa4f35349ba7cb2b6d6). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16639: [SPARK-19276][CORE] Fetch Failure handling robust to use...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16639 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16639: [SPARK-19276][CORE] Fetch Failure handling robust to use...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16639 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/71636/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16639: [SPARK-19276][CORE] Fetch Failure handling robust to use...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16639 **[Test build #71638 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71638/testReport)** for PR 16639 at commit [`b93c37f`](https://github.com/apache/spark/commit/b93c37f1b0dfd3f1293b7a3df8beacc2fec7a33d). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16621: [SPARK-19265][SQL] make table relation cache gene...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/16621#discussion_r96800456 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala --- @@ -586,12 +594,12 @@ class SessionCatalog( desc = metadata, output = metadata.schema.toAttributes, child = parser.parsePlan(viewText)) - SubqueryAlias(relationAlias, child, Option(name)) + SubqueryAlias(relationAlias, child, Some(name.copy(table = table, database = Some(db } else { SubqueryAlias(relationAlias, SimpleCatalogRelation(metadata), None) } } else { -SubqueryAlias(relationAlias, tempTables(table), Option(name)) +SubqueryAlias(relationAlias, tempTables(table), None) --- End diff -- ping @hvanhovell again. : ) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16581: [SPARK-18589] [SQL] Fix Python UDF accessing attributes ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16581 **[Test build #3541 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3541/testReport)** for PR 16581 at commit [`e4db820`](https://github.com/apache/spark/commit/e4db8209843379fdd385dbf299baca7dea410075). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16639: [SPARK-19276][CORE] Fetch Failure handling robust to use...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16639 **[Test build #71637 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71637/testReport)** for PR 16639 at commit [`5c28b62`](https://github.com/apache/spark/commit/5c28b62941fc08438e16a92bc70041636cc1dbee). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16633: [SPARK-19274][SQL] Make GlobalLimit without shuffling da...
Github user scwf commented on the issue: https://github.com/apache/spark/pull/16633 Yes, you are right, we can not ensure the uniform distribution for global limit. An idea is not use a special partitioner, after the shuffle we should get the mapoutput statistics for row num of each bucket, and decide each global limit should take how many element. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16639: [SPARK-19276][CORE] Fetch Failure handling robust to use...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16639 **[Test build #71636 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71636/testReport)** for PR 16639 at commit [`0091aba`](https://github.com/apache/spark/commit/0091abacb930642a4ef2178a31be7d6b70462766). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16639: [SPARK-19276][CORE] Fetch Failure handling robust...
GitHub user squito opened a pull request: https://github.com/apache/spark/pull/16639 [SPARK-19276][CORE] Fetch Failure handling robust to user error handling ## What changes were proposed in this pull request? Fault-tolerance in spark requires special handling of shuffle fetch failures. The Executor would catch FetchFailedException and send a special msg back to the driver. However, intervening user code could intercept that exception, and wrap it with something else. This even happens in SparkSQL. So rather than checking the exception directly, we'll store the fetch failure directly in the TaskContext, where users can't touch it. ## How was this patch tested? Added a test case which failed before the fix. Full test suite via jenkins. You can merge this pull request into a Git repository by running: $ git pull https://github.com/squito/spark SPARK-19276 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/16639.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #16639 commit 0091abacb930642a4ef2178a31be7d6b70462766 Author: Imran RashidDate: 2017-01-18T21:55:50Z [SPARK-19276][CORE] Fetch Failure handling robust to user error handling Fault-tolerance in spark requires special handling of shuffle fetch failures. The Executor would catch FetchFailedException and send a special msg back to the driver. However, intervening user code could intercept that exception, and wrap it with something else. This even happens in SparkSQL. So rather than checking the exception directly, we'll store the fetch failure directly in the TaskContext, where users can't touch it. This includes a test case which failed before the fix. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16633: [SPARK-19274][SQL] Make GlobalLimit without shuffling da...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16633 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/71633/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16633: [SPARK-19274][SQL] Make GlobalLimit without shuffling da...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16633 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16633: [SPARK-19274][SQL] Make GlobalLimit without shuffling da...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16633 **[Test build #71633 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71633/testReport)** for PR 16633 at commit [`3cbd6ee`](https://github.com/apache/spark/commit/3cbd6ee19a994d368a4130da47a2554bd0019679). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16635: [SPARK-19059] [SQL] Unable to retrieve data from parquet...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16635 **[Test build #71635 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71635/testReport)** for PR 16635 at commit [`ea6bd7d`](https://github.com/apache/spark/commit/ea6bd7d00c4d6ef5aea158a3fc8c3bfc5a0c02e4). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16635: [SPARK-19059] [SQL] Unable to retrieve data from parquet...
Github user jayadevanmurali commented on the issue: https://github.com/apache/spark/pull/16635 retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16635: [SPARK-19059] [SQL] Unable to retrieve data from parquet...
Github user jayadevanmurali commented on the issue: https://github.com/apache/spark/pull/16635 @cloud-fan Incorporated code review comments --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16605: [SPARK-18884][SQL] Support Array[_] in ScalaUDF
Github user maropu commented on the issue: https://github.com/apache/spark/pull/16605 okay. But, if this issue finished, I'm planning to take SPARK-12823 in a similar way. Do u think also it's not also worth trying struct? cc: @cloud-fan --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16605: [SPARK-18884][SQL] Support Array[_] in ScalaUDF
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16605 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/71631/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16605: [SPARK-18884][SQL] Support Array[_] in ScalaUDF
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16605 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16605: [SPARK-18884][SQL] Support Array[_] in ScalaUDF
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16605 **[Test build #71631 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71631/testReport)** for PR 16605 at commit [`35715a4`](https://github.com/apache/spark/commit/35715a4b6847f56f62038e9bbd77bf4a83250410). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16635: [SPARK-19059] [SQL] Unable to retrieve data from parquet...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16635 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16635: [SPARK-19059] [SQL] Unable to retrieve data from parquet...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16635 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/71630/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16605: [SPARK-18884][SQL] Support Array[_] in ScalaUDF
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/16605 Well, it will be good if we can support `Array` in `ScalaUDF`, but it's not a big deal as users can easily do `udf { (seq: Seq[Int]) => val a = seq.toArray; // do anything you like with the array }`. considering the size of this PR, I don't think it worth. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16635: [SPARK-19059] [SQL] Unable to retrieve data from parquet...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16635 **[Test build #71630 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71630/testReport)** for PR 16635 at commit [`71be60f`](https://github.com/apache/spark/commit/71be60f38bbc18e05b90f4f4837dcda6cde2460d). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16593: [SPARK-19153][SQL]DataFrameWriter.saveAsTable work with ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16593 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16593: [SPARK-19153][SQL]DataFrameWriter.saveAsTable work with ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16593 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/71634/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16593: [SPARK-19153][SQL]DataFrameWriter.saveAsTable work with ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16593 **[Test build #71634 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71634/testReport)** for PR 16593 at commit [`14aed85`](https://github.com/apache/spark/commit/14aed85b6b3b083b8a4fdb3a3cab65f1eebc8729). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16593: [SPARK-19153][SQL]DataFrameWriter.saveAsTable work with ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16593 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/71632/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16593: [SPARK-19153][SQL]DataFrameWriter.saveAsTable work with ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16593 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16593: [SPARK-19153][SQL]DataFrameWriter.saveAsTable work with ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16593 **[Test build #71632 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71632/testReport)** for PR 16593 at commit [`9270851`](https://github.com/apache/spark/commit/9270851f0b358c30a14f0f63eded25b68b38b102). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16634: [SPARK-16968][SQL][Backport-2.0]Add additional op...
Github user gatorsmile closed the pull request at: https://github.com/apache/spark/pull/16634 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16634: [SPARK-16968][SQL][Backport-2.0]Add additional options i...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/16634 Thanks for the review, Merging to 2.0! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16605: [SPARK-18884][SQL] Support Array[_] in ScalaUDF
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/16605 Sure, @maropu . `WrappedArray` is not documented for now. Hi, @gatorsmile and @cloud-fan . Could you review this PR? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16633: [SPARK-19274][SQL] Make GlobalLimit without shuffling da...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/16633 @scwf No. A simple example: if there are 5 local limit which produce 1, 2, 1, 1, 1 rows when limit is 10. If you shuffle to 5 partitions, the distributions for each local limit look like: 1: (1, 0, 0, 0, 0) 2: (1, 1, 0, 0, 0) 3: (1, 0, 0, 0, 0) 4: (1, 0, 0, 0, 0) 5: (1, 0, 0, 0, 0) So the final rows in 5 partitions are (5, 1, 0, 0, 0) which is not uniformly distributed. You don't know how many rows each local limit can get. So how do you know how many partitions and how many rows to retrieve for each partitions? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16213: [SPARK-18020][Streaming][Kinesis] Checkpoint SHARD_END t...
Github user maropu commented on the issue: https://github.com/apache/spark/pull/16213 @tdas ping --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16635: [SPARK-19059] [SQL] Unable to retrieve data from ...
Github user jayadevanmurali commented on a diff in the pull request: https://github.com/apache/spark/pull/16635#discussion_r96790387 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala --- @@ -2513,4 +2513,18 @@ class SQLQuerySuite extends QueryTest with SharedSQLContext { } } } + + test( +"SPARK-19059: Unable to retrieve data from parquet table whose name startswith underscore") { +sql("CREATE TABLE `_tbl`(i INT) USING parquet") +sql("INSERT INTO `_tbl` VALUES (1), (2), (3)") +checkAnswer( sql("SELECT * FROM `_tbl`"), Row(1) :: Row(2) :: Row(3) :: Nil) --- End diff -- It should be fine. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16605: [SPARK-18884][SQL] Support Array[_] in ScalaUDF
Github user maropu commented on the issue: https://github.com/apache/spark/pull/16605 oh, yea. I didn't find that and I think it's a good point. IMO `WrappedArray` is implicitly used inside for implicit conversions, so users do not use `WrappedArray` directly for UDFs in most cases. Anyway, thanks alots for your reviews! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16635: [SPARK-19059] [SQL] Unable to retrieve data from ...
Github user jayadevanmurali commented on a diff in the pull request: https://github.com/apache/spark/pull/16635#discussion_r96790264 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala --- @@ -2513,4 +2513,18 @@ class SQLQuerySuite extends QueryTest with SharedSQLContext { } } } + + test( +"SPARK-19059: Unable to retrieve data from parquet table whose name startswith underscore") { +sql("CREATE TABLE `_tbl`(i INT) USING parquet") --- End diff -- I will do this change --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16635: [SPARK-19059] [SQL] Unable to retrieve data from ...
Github user jayadevanmurali commented on a diff in the pull request: https://github.com/apache/spark/pull/16635#discussion_r96790238 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala --- @@ -2513,4 +2513,18 @@ class SQLQuerySuite extends QueryTest with SharedSQLContext { } } } + + test( +"SPARK-19059: Unable to retrieve data from parquet table whose name startswith underscore") { --- End diff -- Yes, it is not parquet only. I think SPARK-19059: read file based table whose name starts with underscore is fine. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16635: [SPARK-19059] [SQL] Unable to retrieve data from ...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/16635#discussion_r96790019 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala --- @@ -2513,4 +2513,18 @@ class SQLQuerySuite extends QueryTest with SharedSQLContext { } } } + + test( +"SPARK-19059: Unable to retrieve data from parquet table whose name startswith underscore") { +sql("CREATE TABLE `_tbl`(i INT) USING parquet") +sql("INSERT INTO `_tbl` VALUES (1), (2), (3)") +checkAnswer( sql("SELECT * FROM `_tbl`"), Row(1) :: Row(2) :: Row(3) :: Nil) --- End diff -- I think we can stop here, create a table with underscore, then insert into it, then read it, that's enough to prove we can support table with underscore. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16633: [SPARK-19274][SQL] Make GlobalLimit without shuffling da...
Github user scwf commented on the issue: https://github.com/apache/spark/pull/16633 refer to the maillist >One issue left is how to decide shuffle partition number. We can have a config of the maximum number of elements for each GlobalLimit task to process, then do a factorization to get a number most close to that config. E.g. the config is 2000: if limit=1, 1 = 2000 * 5, we shuffle to 5 partitions if limit=, = * 9, we shuffle to 9 partitions if limit is a prime number, we just fall back to single partition You mean for the prime number case? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16605: [SPARK-18884][SQL] Support Array[_] in ScalaUDF
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/16605#discussion_r96789868 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ScalaUDF.scala --- @@ -84,7 +86,9 @@ case class ScalaUDF( case 1 => val func = function.asInstanceOf[(Any) => Any] val child0 = children(0) - lazy val converter0 = CatalystTypeConverters.createToScalaConverter(child0.dataType) + lazy val converter0 = inputConverters.map { --- End diff -- okay, fixed! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16635: [SPARK-19059] [SQL] Unable to retrieve data from ...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/16635#discussion_r96789821 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala --- @@ -2513,4 +2513,18 @@ class SQLQuerySuite extends QueryTest with SharedSQLContext { } } } + + test( +"SPARK-19059: Unable to retrieve data from parquet table whose name startswith underscore") { +sql("CREATE TABLE `_tbl`(i INT) USING parquet") --- End diff -- use `withTable("tbl", "_tbl") {...}` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16635: [SPARK-19059] [SQL] Unable to retrieve data from ...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/16635#discussion_r96789777 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala --- @@ -2513,4 +2513,18 @@ class SQLQuerySuite extends QueryTest with SharedSQLContext { } } } + + test( +"SPARK-19059: Unable to retrieve data from parquet table whose name startswith underscore") { --- End diff -- this bug is not parquet only right? how about `SPARK-19059: read file based table whose name starts with underscore` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16633: [SPARK-19274][SQL] Make GlobalLimit without shuffling da...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16633 **[Test build #71633 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71633/testReport)** for PR 16633 at commit [`3cbd6ee`](https://github.com/apache/spark/commit/3cbd6ee19a994d368a4130da47a2554bd0019679). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16593: [SPARK-19153][SQL]DataFrameWriter.saveAsTable work with ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16593 **[Test build #71634 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71634/testReport)** for PR 16593 at commit [`14aed85`](https://github.com/apache/spark/commit/14aed85b6b3b083b8a4fdb3a3cab65f1eebc8729). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16633: [SPARK-19274][SQL] Make GlobalLimit without shuffling da...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/16633 @scwf > it use a special partitioner to do this, the partitioner like the row_numer in sql it give each row a uniform partitionid, so in the reduce task, each task handle num of rows very closely. I see @wzhfy wants to use a partitioner to uniformly distribute the rows in each local limit. However, because each local limit can produce different number of rows, you can't get a real uniform distribution. So in the global limit operation, you can't know how many partitions you need to use in order to satisfy the final limit number. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16593: [SPARK-19153][SQL]DataFrameWriter.saveAsTable wor...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/16593#discussion_r96788653 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/CreateHiveTableAsSelectCommand.scala --- @@ -87,8 +101,8 @@ case class CreateHiveTableAsSelectCommand( } } else { try { -sparkSession.sessionState.executePlan(InsertIntoTable( - metastoreRelation, Map(), query, overwrite = true, ifNotExists = false)).toRdd + sparkSession.sessionState.executePlan(InsertIntoTable(metastoreRelation, --- End diff -- oh, we should be fine here, the table is created with `reorderedOutputQuery.schema`, so there won't be any type difference --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16633: [SPARK-19274][SQL] Make GlobalLimit without shuffling da...
Github user scwf commented on the issue: https://github.com/apache/spark/pull/16633 To clear, now we have these issues: 1. local limit compute all partitions, that means it launch many tasks but actually maybe very small tasks is enough. 2. global limit single partition issue, now the global limit will shuffle all the data to one partition, so if the limit num is very big, it cause performance bottleneck It is perfect if we combine the global limit and local limit into one stage, and avoid the shuffle, but for now i can not find a very good solution(no performance regression) to do this without change spark core/scheduler, your solution is trying to do that, but as i suggest, there are some cases the performance maybe worse. @wzhfy 's idea is just resolve the single partition issue, still shuffle, still local limit on all the partitions, but it not bring performance down in that cases compare with current code path. > Another issue is, how do you make sure you create a uniform distribution of the result of local limit. Each local limit can produce different number of rows. it use a special partitioner to do this, the partitioner like the `row_numer` in sql it give each row a uniform partitionid, so in the reduce task, each task handle num of rows very closely. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16593: [SPARK-19153][SQL]DataFrameWriter.saveAsTable work with ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16593 **[Test build #71632 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71632/testReport)** for PR 16593 at commit [`9270851`](https://github.com/apache/spark/commit/9270851f0b358c30a14f0f63eded25b68b38b102). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16593: [SPARK-19153][SQL]DataFrameWriter.saveAsTable wor...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/16593#discussion_r96788538 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala --- @@ -1361,6 +1355,22 @@ class HiveDDLSuite } } + test("create hive serde table as select") { --- End diff -- `create partitioned hive serde table as select` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16633: [SPARK-19274][SQL] Make GlobalLimit without shuffling da...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16633 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/71627/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16633: [SPARK-19274][SQL] Make GlobalLimit without shuffling da...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16633 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16633: [SPARK-19274][SQL] Make GlobalLimit without shuffling da...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16633 **[Test build #71627 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71627/testReport)** for PR 16633 at commit [`6ba8b28`](https://github.com/apache/spark/commit/6ba8b284ec8f43a76c9ba54349438e484a097223). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `case class LocalLimitExec(limit: Int, child: SparkPlan) extends UnaryExecNode with CodegenSupport ` * `case class GlobalLimitExec(limit: Int, child: SparkPlan) extends UnaryExecNode ` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16605: [SPARK-18884][SQL] Support Array[_] in ScalaUDF
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16605 **[Test build #71631 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71631/testReport)** for PR 16605 at commit [`35715a4`](https://github.com/apache/spark/commit/35715a4b6847f56f62038e9bbd77bf4a83250410). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16635: [SPARK-19059] [SQL] Unable to retrieve data from parquet...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16635 **[Test build #71630 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71630/testReport)** for PR 16635 at commit [`71be60f`](https://github.com/apache/spark/commit/71be60f38bbc18e05b90f4f4837dcda6cde2460d). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16635: [SPARK-19059] [SQL] Unable to retrieve data from parquet...
Github user jayadevanmurali commented on the issue: https://github.com/apache/spark/pull/16635 retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16635: [SPARK-19059] [SQL] Unable to retrieve data from parquet...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16635 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/71629/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16635: [SPARK-19059] [SQL] Unable to retrieve data from parquet...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16635 **[Test build #71629 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71629/testReport)** for PR 16635 at commit [`499711d`](https://github.com/apache/spark/commit/499711d96f5f776baf482e0cbc12cd55f8c9b2c2). * This patch **fails Scala style tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16635: [SPARK-19059] [SQL] Unable to retrieve data from parquet...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16635 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16635: [SPARK-19059] [SQL] Unable to retrieve data from parquet...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16635 **[Test build #71629 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71629/testReport)** for PR 16635 at commit [`499711d`](https://github.com/apache/spark/commit/499711d96f5f776baf482e0cbc12cd55f8c9b2c2). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16635: [SPARK-19059] [SQL] Unable to retrieve data from parquet...
Github user jayadevanmurali commented on the issue: https://github.com/apache/spark/pull/16635 retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16552: [SPARK-19152][SQL]DataFrameWriter.saveAsTable support hi...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/16552 The overall idea is to use `InsertIntable` to implement appending to hive table, but this approach is too hacky, we should follow the way how we deal with data source table, e.g. `DataFrameWriter.saveAsTable` just build a `CreateTable` plan, rule `AnalyzeCreateTable` do some checking and normalization, and another rule turn `CreateTable` into `CreateDataSourceTableAsSelectCommand`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16633: [SPARK-19274][SQL] Make GlobalLimit without shuff...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/16633#discussion_r96786080 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/limit.scala --- @@ -90,21 +94,74 @@ trait BaseLimitExec extends UnaryExecNode with CodegenSupport { } /** - * Take the first `limit` elements of each child partition, but do not collect or shuffle them. + * Take the first `limit` elements of the child's partitions. */ -case class LocalLimitExec(limit: Int, child: SparkPlan) extends BaseLimitExec { - - override def outputOrdering: Seq[SortOrder] = child.outputOrdering - - override def outputPartitioning: Partitioning = child.outputPartitioning -} +case class GlobalLimitExec(limit: Int, child: SparkPlan) extends UnaryExecNode { + override def output: Seq[Attribute] = child.output -/** - * Take the first `limit` elements of the child's single output partition. - */ -case class GlobalLimitExec(limit: Int, child: SparkPlan) extends BaseLimitExec { + protected override def doExecute(): RDD[InternalRow] = { +// This logic is mainly copyed from `SparkPlan.executeTake`. +// TODO: combine this with `SparkPlan.executeTake`, if possible. +val childRDD = child.execute() +val totalParts = childRDD.partitions.length +var partsScanned = 0 +var totalNum = 0 +var resultRDD: RDD[InternalRow] = null +while (totalNum < limit && partsScanned < totalParts) { + // The number of partitions to try in this iteration. It is ok for this number to be + // greater than totalParts because we actually cap it at totalParts in runJob. + var numPartsToTry = 1L + if (partsScanned > 0) { +// If we didn't find any rows after the previous iteration, quadruple and retry. +// Otherwise, interpolate the number of partitions we need to try, but overestimate +// it by 50%. We also cap the estimation in the end. +val limitScaleUpFactor = Math.max(sqlContext.conf.limitScaleUpFactor, 2) +if (totalNum == 0) { + numPartsToTry = partsScanned * limitScaleUpFactor +} else { + // the left side of max is >=1 whenever partsScanned >= 2 + numPartsToTry = Math.max((1.5 * limit * partsScanned / totalNum).toInt - partsScanned, 1) + numPartsToTry = Math.min(numPartsToTry, partsScanned * limitScaleUpFactor) +} + } - override def requiredChildDistribution: List[Distribution] = AllTuples :: Nil + val p = partsScanned.until(math.min(partsScanned + numPartsToTry, totalParts).toInt) + val sc = sqlContext.sparkContext + val res = sc.runJob(childRDD, +(it: Iterator[InternalRow]) => Array[Int](it.size), p) + + totalNum += res.map(_.head).sum + partsScanned += p.size + + if (totalNum >= limit) { +// If we scan more rows than the limit number, we need to reduce that from scanned. +// We calculate how many rows need to be reduced for each partition, +// until all redunant rows are reduced. +var numToReduce = (totalNum - limit) +val reduceAmounts = new HashMap[Int, Int]() +val partitionsToReduce = p.zip(res.map(_.head)).foreach { case (part, size) => + val toReduce = if (size > numToReduce) numToReduce else size + reduceAmounts += ((part, toReduce)) + numToReduce -= toReduce +} +resultRDD = childRDD.mapPartitionsWithIndexInternal { case (index, iter) => + if (index < partsScanned) { --- End diff -- Yes. The broken RDD job chain causes extra partition scan. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16621: [SPARK-19265][SQL] make table relation cache gene...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/16621#discussion_r96785980 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala --- @@ -118,6 +118,14 @@ class SessionCatalog( } /** + * A cache of qualified table name to table relation plan. + */ + val tableRelationCache: Cache[QualifiedTableName, LogicalPlan] = { +// TODO: create a config instead of hardcode 1000 here. --- End diff -- Yep. Sure~ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16633: [SPARK-19274][SQL] Make GlobalLimit without shuffling da...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/16633 @scwf The main issue the user posted in the mailing list is, the limit is big enough or partition number is big enough to cause performance bottleneck in shuffling the data of local limit. But @wzhfy's idea is also involving shuffling. Another issue is, how do you make sure you create a uniform distribution of the result of local limit. Each local limit can produce different number of rows. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16638: spark-19115
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16638 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16638: spark-19115
GitHub user ouyangxiaochen opened a pull request: https://github.com/apache/spark/pull/16638 spark-19115 ## What changes were proposed in this pull request? sparksql supports the command : create external table if not exists gen_tbl like src_tbl location '/warehouse/gen_tbl' in spark2.X ## How was this patch tested? manual tests You can merge this pull request into a Git repository by running: $ git pull https://github.com/ouyangxiaochen/spark spark19115 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/16638.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #16638 commit adde008588cf8e05cf261c086201c27a8dd5584f Author: ouyangxiaochenDate: 2017-01-19T03:15:17Z spark-19115 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16621: [SPARK-19265][SQL] make table relation cache gene...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/16621#discussion_r96785426 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala --- @@ -118,6 +118,14 @@ class SessionCatalog( } /** + * A cache of qualified table name to table relation plan. + */ + val tableRelationCache: Cache[QualifiedTableName, LogicalPlan] = { +// TODO: create a config instead of hardcode 1000 here. --- End diff -- yea it's easy, but I wanna minimal the code changes so it's easier to review. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16627: [SPARK-19267][SS]Fix a race condition when stoppi...
Github user tdas commented on a diff in the pull request: https://github.com/apache/spark/pull/16627#discussion_r96779672 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStore.scala --- @@ -34,6 +35,132 @@ import org.apache.spark.util.ThreadUtils /** Unique identifier for a [[StateStore]] */ case class StateStoreId(checkpointLocation: String, operatorId: Long, partitionId: Int) +/** + * The class to maintain [[StateStore]]s. When a SparkContext is active (i.e. SparkEnv.get is not + * null), it will run a periodic background task to do maintenance on the loaded stores. The + * background task will be cancelled when `stop` is called or `SparkEnv.get` becomes `null`. + */ +class StateStoreContext extends Logging { + import StateStore._ + + private val maintenanceTaskExecutor = + ThreadUtils.newDaemonSingleThreadScheduledExecutor("state-store-maintenance-task") + + @GuardedBy("StateStore.LOCK") + private val loadedProviders = new mutable.HashMap[StateStoreId, StateStoreProvider]() + + @GuardedBy("StateStore.LOCK") + private var _coordRef: StateStoreCoordinatorRef = null + + @GuardedBy("StateStore.LOCK") + private var isStopped: Boolean = false + + /** Get the state store provider or add `stateStoreProvider` if not exist */ + def getOrElseUpdate( + storeId: StateStoreId, + stateStoreProvider: => StateStoreProvider): StateStoreProvider = LOCK.synchronized { +loadedProviders.getOrElseUpdate(storeId, stateStoreProvider) + } + + /** Unload a state store provider */ + def unload(storeId: StateStoreId): Unit = LOCK.synchronized { loadedProviders.remove(storeId) } + + /** Whether a state store provider is loaded or not */ + def isLoaded(storeId: StateStoreId): Boolean = LOCK.synchronized { +loadedProviders.contains(storeId) + } + + /** Whether the maintenance task is running */ + def isMaintenanceRunning: Boolean = LOCK.synchronized { !isStopped } + + /** Unload and stop all state store providers */ + def stop(): Unit = LOCK.synchronized { +if (!isStopped) { + isStopped = true + loadedProviders.clear() + maintenanceTaskExecutor.shutdown() + logInfo("StateStore stopped") +} + } + + /** Start the periodic maintenance task if not already started and if Spark active */ + private def startMaintenance(): Unit = { +val env = SparkEnv.get +if (env != null) { + val periodMs = env.conf.getTimeAsMs( +MAINTENANCE_INTERVAL_CONFIG, s"${MAINTENANCE_INTERVAL_DEFAULT_SECS}s") + val runnable = new Runnable { +override def run(): Unit = { doMaintenance() } + } + maintenanceTaskExecutor.scheduleAtFixedRate( +runnable, periodMs, periodMs, TimeUnit.MILLISECONDS) + logInfo("State Store maintenance task started") +} + } + + /** + * Execute background maintenance task in all the loaded store providers if they are still + * the active instances according to the coordinator. + */ + private def doMaintenance(): Unit = { +logDebug("Doing maintenance") +if (SparkEnv.get == null) { + stop() +} else { + LOCK.synchronized { loadedProviders.toSeq }.foreach { case (id, provider) => --- End diff -- This is locking the state store while maintenance is going on. since it using the same lock as the external lock this, the task using the store will block on the maintenance task. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org