[GitHub] [spark] AmplabJenkins removed a comment on pull request #28960: [SPARK-32140][ML][PySpark] Add training summary to FMClassificationModel
AmplabJenkins removed a comment on pull request #28960: URL: https://github.com/apache/spark/pull/28960#issuecomment-658006936 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] huaxingao commented on a change in pull request #28960: [SPARK-32140][ML][PySpark] Add training summary to FMClassificationModel
huaxingao commented on a change in pull request #28960: URL: https://github.com/apache/spark/pull/28960#discussion_r454144678 ## File path: mllib/src/main/scala/org/apache/spark/mllib/optimization/GradientDescent.scala ## @@ -226,45 +239,48 @@ object GradientDescent extends Logging { var converged = false // indicates whether converged based on convergenceTol var i = 1 -while (!converged && i <= numIterations) { - val bcWeights = data.context.broadcast(weights) - // Sample a subset (fraction miniBatchFraction) of the total data - // compute and sum up the subgradients on this subset (this is one map-reduce) - val (gradientSum, lossSum, miniBatchSize) = data.sample(false, miniBatchFraction, 42 + i) -.treeAggregate((BDV.zeros[Double](n), 0.0, 0L))( - seqOp = (c, v) => { -// c: (grad, loss, count), v: (label, features) -val l = gradient.compute(v._2, v._1, bcWeights.value, Vectors.fromBreeze(c._1)) -(c._1, c._2 + l, c._3 + 1) - }, - combOp = (c1, c2) => { -// c: (grad, loss, count) -(c1._1 += c2._1, c1._2 + c2._2, c1._3 + c2._3) - }) - bcWeights.destroy() - - if (miniBatchSize > 0) { -/** - * lossSum is computed using the weights from the previous iteration - * and regVal is the regularization value computed in the previous iteration as well. - */ -stochasticLossHistory += lossSum / miniBatchSize + regVal -val update = updater.compute( - weights, Vectors.fromBreeze(gradientSum / miniBatchSize.toDouble), - stepSize, i, regParam) -weights = update._1 -regVal = update._2 - -previousWeights = currentWeights -currentWeights = Some(weights) -if (previousWeights != None && currentWeights != None) { - converged = isConverged(previousWeights.get, -currentWeights.get, convergenceTol) +breakable { + while (i <= numIterations + 1) { +val bcWeights = data.context.broadcast(weights) +// Sample a subset (fraction miniBatchFraction) of the total data +// compute and sum up the subgradients on this subset (this is one map-reduce) +val (gradientSum, lossSum, miniBatchSize) = data.sample(false, miniBatchFraction, 42 + i) + .treeAggregate((BDV.zeros[Double](n), 0.0, 0L))( +seqOp = (c, v) => { Review comment: Fixed. Thanks! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #28960: [SPARK-32140][ML][PySpark] Add training summary to FMClassificationModel
SparkQA commented on pull request #28960: URL: https://github.com/apache/spark/pull/28960#issuecomment-658009295 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29088: [SPARK-32289][SQL] Some characters are garbled when opening csv files with Excel
SparkQA commented on pull request #29088: URL: https://github.com/apache/spark/pull/29088#issuecomment-658009294 **[Test build #125807 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125807/testReport)** for PR 29088 at commit [`6111a0a`](https://github.com/apache/spark/commit/6111a0a495fc1c0650a472d985ea221f8008f81f). * This patch **fails due to an unknown error code, -9**. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #28917: [SPARK-31847][CORE][TESTS] DAGSchedulerSuite: Rewrite the test framework to support apply specified spark configurations.
SparkQA commented on pull request #28917: URL: https://github.com/apache/spark/pull/28917#issuecomment-658009292 **[Test build #125808 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125808/testReport)** for PR 28917 at commit [`ec0d8d0`](https://github.com/apache/spark/commit/ec0d8d00b64662343dc6b3945dc5999343b699a7). * This patch **fails due to an unknown error code, -9**. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29088: [SPARK-32289][SQL] Some characters are garbled when opening csv files with Excel
AmplabJenkins commented on pull request #29088: URL: https://github.com/apache/spark/pull/29088#issuecomment-658009403 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #28939: [SPARK-32119][CORE] ExecutorPlugin doesn't work with Standalone Cluster
SparkQA commented on pull request #28939: URL: https://github.com/apache/spark/pull/28939#issuecomment-658009291 **[Test build #125803 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125803/testReport)** for PR 28939 at commit [`449df2b`](https://github.com/apache/spark/commit/449df2b92e5ad0dac6ea8dd83233450946a39df2). * This patch **fails due to an unknown error code, -9**. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #28901: [SPARK-32064][SQL] Supporting create temporary table
SparkQA commented on pull request #28901: URL: https://github.com/apache/spark/pull/28901#issuecomment-658009289 **[Test build #125805 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125805/testReport)** for PR 28901 at commit [`9b11aac`](https://github.com/apache/spark/commit/9b11aace28be8169e8eff1ce61810bc8250fc37d). * This patch **fails due to an unknown error code, -9**. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #28960: [SPARK-32140][ML][PySpark] Add training summary to FMClassificationModel
AmplabJenkins commented on pull request #28960: URL: https://github.com/apache/spark/pull/28960#issuecomment-658009391 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #28708: [SPARK-20629][CORE][K8S] Copy shuffle data when nodes are being shutdown
SparkQA commented on pull request #28708: URL: https://github.com/apache/spark/pull/28708#issuecomment-658009287 **[Test build #125806 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125806/testReport)** for PR 28708 at commit [`5a0cd2a`](https://github.com/apache/spark/commit/5a0cd2abd316aacc601b9e8fa6e1406b67c55fb7). * This patch **fails due to an unknown error code, -9**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `public final class MapOutputCommitMessage ` * ` case class IsExecutorAlive(executorId: String) extends CoarseGrainedClusterMessage` * `sealed trait LogisticRegressionSummary extends ClassificationSummary ` * `sealed trait RandomForestClassificationSummary extends ClassificationSummary ` * `class _ClassificationSummary(JavaWrapper):` * `class _TrainingSummary(JavaWrapper):` * `class _BinaryClassificationSummary(_ClassificationSummary):` * `class LinearSVCModel(_JavaClassificationModel, _LinearSVCParams, JavaMLWritable, JavaMLReadable,` * `class LinearSVCSummary(_BinaryClassificationSummary):` * `class LinearSVCTrainingSummary(LinearSVCSummary, _TrainingSummary):` * `class LogisticRegressionSummary(_ClassificationSummary):` * `class LogisticRegressionTrainingSummary(LogisticRegressionSummary, _TrainingSummary):` * `class BinaryLogisticRegressionSummary(_BinaryClassificationSummary,` * `class RandomForestClassificationSummary(_ClassificationSummary):` * `class RandomForestClassificationTrainingSummary(RandomForestClassificationSummary,` * `class BinaryRandomForestClassificationSummary(_BinaryClassificationSummary):` * `class BinaryRandomForestClassificationTrainingSummary(BinaryRandomForestClassificationSummary,` * ` class DisableHints(conf: SQLConf) extends RemoveAllHints(conf: SQLConf) ` * `case class WithFields(` * `case class Hour(child: Expression, timeZoneId: Option[String] = None) extends GetTimeField ` * `case class Minute(child: Expression, timeZoneId: Option[String] = None) extends GetTimeField ` * `case class Second(child: Expression, timeZoneId: Option[String] = None) extends GetTimeField ` * `trait GetDateField extends UnaryExpression with ImplicitCastInputTypes with NullIntolerant ` * `case class DayOfYear(child: Expression) extends GetDateField ` * `case class SecondsToTimestamp(child: Expression) extends UnaryExpression` * `case class Year(child: Expression) extends GetDateField ` * `case class YearOfWeek(child: Expression) extends GetDateField ` * `case class Quarter(child: Expression) extends GetDateField ` * `case class Month(child: Expression) extends GetDateField ` * `case class DayOfMonth(child: Expression) extends GetDateField ` * `case class DayOfWeek(child: Expression) extends GetDateField ` * `case class WeekDay(child: Expression) extends GetDateField ` * `case class WeekOfYear(child: Expression) extends GetDateField ` * `sealed trait UTCTimestamp extends BinaryExpression with ImplicitCastInputTypes with NullIntolerant ` * `case class FromUTCTimestamp(left: Expression, right: Expression) extends UTCTimestamp ` * `case class ToUTCTimestamp(left: Expression, right: Expression) extends UTCTimestamp ` * `sealed abstract class MergeAction extends Expression with Unevaluable ` * `case class DeleteAction(condition: Option[Expression]) extends MergeAction` * `trait BaseScriptTransformationExec extends UnaryExecNode ` * `abstract class BaseScriptTransformationWriterThread(` * `abstract class BaseScriptTransformIOSchema extends Serializable ` * `case class CoalesceBucketsInSortMergeJoin(conf: SQLConf) extends Rule[SparkPlan] ` * `class StateStoreConf(` * `case class HiveScriptTransformationExec(` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29087: [SPARK-28227][SQL] Support TRANSFORM with aggregation
SparkQA commented on pull request #29087: URL: https://github.com/apache/spark/pull/29087#issuecomment-658009286 **[Test build #125797 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125797/testReport)** for PR 29087 at commit [`5d85160`](https://github.com/apache/spark/commit/5d85160abca388a53054551ad7ce9e48e363dcd5). * This patch **fails due to an unknown error code, -9**. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #28960: [SPARK-32140][ML][PySpark] Add training summary to FMClassificationModel
SparkQA removed a comment on pull request #28960: URL: https://github.com/apache/spark/pull/28960#issuecomment-658003670 **[Test build #125809 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125809/testReport)** for PR 28960 at commit [`0767117`](https://github.com/apache/spark/commit/07671170b7dac6227e4c1a98f58bf24f9be9ad25). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #28917: [SPARK-31847][CORE][TESTS] DAGSchedulerSuite: Rewrite the test framework to support apply specified spark configurations.
AmplabJenkins commented on pull request #28917: URL: https://github.com/apache/spark/pull/28917#issuecomment-658009351 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #29086: [SPARK-32292][SPARK-32252][INFRA] Run the relevant tests only in GitHub Actions
HyukjinKwon commented on a change in pull request #29086: URL: https://github.com/apache/spark/pull/29086#discussion_r454146710 ## File path: dev/run-tests.py ## @@ -589,43 +627,74 @@ def main(): # /home/jenkins/anaconda2/envs/py36/bin os.environ["PATH"] = "/home/anaconda/envs/py36/bin:" + os.environ.get("PATH") else: -# else we're running locally and can use local settings +# else we're running locally or Github Actions. build_tool = "sbt" hadoop_version = os.environ.get("HADOOP_PROFILE", "hadoop2.7") hive_version = os.environ.get("HIVE_PROFILE", "hive2.3") -test_env = "local" +if "GITHUB_ACTIONS" in os.environ: +test_env = "github_actions" +else: +test_env = "local" print("[info] Using build tool", build_tool, "with Hadoop profile", hadoop_version, "and Hive profile", hive_version, "under environment", test_env) extra_profiles = get_hadoop_profiles(hadoop_version) + get_hive_profiles(hive_version) changed_modules = None +test_modules = None changed_files = None -should_only_test_modules = "TEST_ONLY_MODULES" in os.environ +should_only_test_modules = opts.modules is not None included_tags = [] +excluded_tags = [] if should_only_test_modules: -str_test_modules = [m.strip() for m in os.environ.get("TEST_ONLY_MODULES").split(",")] +str_test_modules = [m.strip() for m in opts.modules.split(",")] test_modules = [m for m in modules.all_modules if m.name in str_test_modules] -# Directly uses test_modules as changed modules to apply tags and environments -# as if all specified test modules are changed. + +# If we're running the tests in Github Actions, attempt to detect and test +# only the affected modules. +if test_env == "github_actions": +if os.environ["GITHUB_BASE_REF"] != "": +# Pull requests +changed_files = identify_changed_files_from_git_commits( +os.environ["GITHUB_SHA"], target_branch=os.environ["GITHUB_BASE_REF"]) Review comment: This is an example of the merge commit: https://github.com/HyukjinKwon/spark/commit/8f36ec455e19dbfb10195d872a9ccaeb2de8ceca at https://github.com/HyukjinKwon/spark/pull/7 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #28917: [SPARK-31847][CORE][TESTS] DAGSchedulerSuite: Rewrite the test framework to support apply specified spark configurations.
SparkQA removed a comment on pull request #28917: URL: https://github.com/apache/spark/pull/28917#issuecomment-658000826 **[Test build #125808 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125808/testReport)** for PR 28917 at commit [`ec0d8d0`](https://github.com/apache/spark/commit/ec0d8d00b64662343dc6b3945dc5999343b699a7). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #29088: [SPARK-32289][SQL] Some characters are garbled when opening csv files with Excel
SparkQA removed a comment on pull request #29088: URL: https://github.com/apache/spark/pull/29088#issuecomment-657995711 **[Test build #125807 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125807/testReport)** for PR 29088 at commit [`6111a0a`](https://github.com/apache/spark/commit/6111a0a495fc1c0650a472d985ea221f8008f81f). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28960: [SPARK-32140][ML][PySpark] Add training summary to FMClassificationModel
AmplabJenkins removed a comment on pull request #28960: URL: https://github.com/apache/spark/pull/28960#issuecomment-658009391 Build finished. Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #28708: [SPARK-20629][CORE][K8S] Copy shuffle data when nodes are being shutdown
SparkQA removed a comment on pull request #28708: URL: https://github.com/apache/spark/pull/28708#issuecomment-657987750 **[Test build #125806 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125806/testReport)** for PR 28708 at commit [`5a0cd2a`](https://github.com/apache/spark/commit/5a0cd2abd316aacc601b9e8fa6e1406b67c55fb7). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28939: [SPARK-32119][CORE] ExecutorPlugin doesn't work with Standalone Cluster
AmplabJenkins removed a comment on pull request #28939: URL: https://github.com/apache/spark/pull/28939#issuecomment-658009729 Merged build finished. Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28917: [SPARK-31847][CORE][TESTS] DAGSchedulerSuite: Rewrite the test framework to support apply specified spark configurations.
AmplabJenkins removed a comment on pull request #28917: URL: https://github.com/apache/spark/pull/28917#issuecomment-658009351 Merged build finished. Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #28939: [SPARK-32119][CORE] ExecutorPlugin doesn't work with Standalone Cluster
SparkQA removed a comment on pull request #28939: URL: https://github.com/apache/spark/pull/28939#issuecomment-657963838 **[Test build #125803 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125803/testReport)** for PR 28939 at commit [`449df2b`](https://github.com/apache/spark/commit/449df2b92e5ad0dac6ea8dd83233450946a39df2). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28901: [SPARK-32064][SQL] Supporting create temporary table
AmplabJenkins removed a comment on pull request #28901: URL: https://github.com/apache/spark/pull/28901#issuecomment-658009628 Merged build finished. Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #28939: [SPARK-32119][CORE] ExecutorPlugin doesn't work with Standalone Cluster
AmplabJenkins commented on pull request #28939: URL: https://github.com/apache/spark/pull/28939#issuecomment-658009729 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #28708: [SPARK-20629][CORE][K8S] Copy shuffle data when nodes are being shutdown
AmplabJenkins commented on pull request #28708: URL: https://github.com/apache/spark/pull/28708#issuecomment-658009462 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29087: [SPARK-28227][SQL] Support TRANSFORM with aggregation
AmplabJenkins commented on pull request #29087: URL: https://github.com/apache/spark/pull/29087#issuecomment-658009765 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #29087: [SPARK-28227][SQL] Support TRANSFORM with aggregation
SparkQA removed a comment on pull request #29087: URL: https://github.com/apache/spark/pull/29087#issuecomment-657914322 **[Test build #125797 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125797/testReport)** for PR 29087 at commit [`5d85160`](https://github.com/apache/spark/commit/5d85160abca388a53054551ad7ce9e48e363dcd5). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #28901: [SPARK-32064][SQL] Supporting create temporary table
AmplabJenkins commented on pull request #28901: URL: https://github.com/apache/spark/pull/28901#issuecomment-658009628 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28708: [SPARK-20629][CORE][K8S] Copy shuffle data when nodes are being shutdown
AmplabJenkins removed a comment on pull request #28708: URL: https://github.com/apache/spark/pull/28708#issuecomment-658000525 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28901: [SPARK-32064][SQL] Supporting create temporary table
AmplabJenkins removed a comment on pull request #28901: URL: https://github.com/apache/spark/pull/28901#issuecomment-658009630 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/125805/ Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28939: [SPARK-32119][CORE] ExecutorPlugin doesn't work with Standalone Cluster
AmplabJenkins removed a comment on pull request #28939: URL: https://github.com/apache/spark/pull/28939#issuecomment-658009737 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/125803/ Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29087: [SPARK-28227][SQL] Support TRANSFORM with aggregation
AmplabJenkins removed a comment on pull request #29087: URL: https://github.com/apache/spark/pull/29087#issuecomment-658009765 Merged build finished. Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29088: [SPARK-32289][SQL] Some characters are garbled when opening csv files with Excel
AmplabJenkins removed a comment on pull request #29088: URL: https://github.com/apache/spark/pull/29088#issuecomment-658009412 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/125807/ Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28708: [SPARK-20629][CORE][K8S] Copy shuffle data when nodes are being shutdown
AmplabJenkins removed a comment on pull request #28708: URL: https://github.com/apache/spark/pull/28708#issuecomment-658009471 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/125806/ Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28917: [SPARK-31847][CORE][TESTS] DAGSchedulerSuite: Rewrite the test framework to support apply specified spark configurations.
AmplabJenkins removed a comment on pull request #28917: URL: https://github.com/apache/spark/pull/28917#issuecomment-658009362 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/125808/ Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29088: [SPARK-32289][SQL] Some characters are garbled when opening csv files with Excel
AmplabJenkins removed a comment on pull request #29088: URL: https://github.com/apache/spark/pull/29088#issuecomment-658009403 Merged build finished. Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #28901: [SPARK-32064][SQL] Supporting create temporary table
SparkQA removed a comment on pull request #28901: URL: https://github.com/apache/spark/pull/28901#issuecomment-657973941 **[Test build #125805 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125805/testReport)** for PR 28901 at commit [`9b11aac`](https://github.com/apache/spark/commit/9b11aace28be8169e8eff1ce61810bc8250fc37d). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29087: [SPARK-28227][SQL] Support TRANSFORM with aggregation
AmplabJenkins removed a comment on pull request #29087: URL: https://github.com/apache/spark/pull/29087#issuecomment-658009772 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/125797/ Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya commented on a change in pull request #29086: [SPARK-32292][SPARK-32252][INFRA] Run the relevant tests only in GitHub Actions
viirya commented on a change in pull request #29086: URL: https://github.com/apache/spark/pull/29086#discussion_r454151281 ## File path: dev/run-tests.py ## @@ -589,43 +627,74 @@ def main(): # /home/jenkins/anaconda2/envs/py36/bin os.environ["PATH"] = "/home/anaconda/envs/py36/bin:" + os.environ.get("PATH") else: -# else we're running locally and can use local settings +# else we're running locally or Github Actions. build_tool = "sbt" hadoop_version = os.environ.get("HADOOP_PROFILE", "hadoop2.7") hive_version = os.environ.get("HIVE_PROFILE", "hive2.3") -test_env = "local" +if "GITHUB_ACTIONS" in os.environ: +test_env = "github_actions" +else: +test_env = "local" print("[info] Using build tool", build_tool, "with Hadoop profile", hadoop_version, "and Hive profile", hive_version, "under environment", test_env) extra_profiles = get_hadoop_profiles(hadoop_version) + get_hive_profiles(hive_version) changed_modules = None +test_modules = None changed_files = None -should_only_test_modules = "TEST_ONLY_MODULES" in os.environ +should_only_test_modules = opts.modules is not None included_tags = [] +excluded_tags = [] if should_only_test_modules: -str_test_modules = [m.strip() for m in os.environ.get("TEST_ONLY_MODULES").split(",")] +str_test_modules = [m.strip() for m in opts.modules.split(",")] test_modules = [m for m in modules.all_modules if m.name in str_test_modules] -# Directly uses test_modules as changed modules to apply tags and environments -# as if all specified test modules are changed. + +# If we're running the tests in Github Actions, attempt to detect and test +# only the affected modules. +if test_env == "github_actions": +if os.environ["GITHUB_BASE_REF"] != "": +# Pull requests +changed_files = identify_changed_files_from_git_commits( +os.environ["GITHUB_SHA"], target_branch=os.environ["GITHUB_BASE_REF"]) Review comment: Okay. Thanks for clarifying. Looks good. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon opened a new pull request #29096: [WIP][TESTS] Enable test-dependencies.sh and Unidoc test in Jenkins jobs
HyukjinKwon opened a new pull request #29096: URL: https://github.com/apache/spark/pull/29096 ### What changes were proposed in this pull request? Seems like Jenkins machines came back to normal. Maybe we should just re-enable dependency test and Javadoc/Scaladoc build in Jenkins for simplicity. Now, without corner case exceptions, we can merge if Jenkins or GitHub Actions build pass without depending on each other for dependency testing or Unidoc. ### Why are the changes needed? For simplicity. ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? Jenkins will test it here. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #29096: [WIP][TESTS] Enable test-dependencies.sh and Unidoc test in Jenkins jobs
HyukjinKwon commented on pull request #29096: URL: https://github.com/apache/spark/pull/29096#issuecomment-658016315 retest this please This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29096: [WIP][TESTS] Enable test-dependencies.sh and Unidoc test in Jenkins jobs
SparkQA commented on pull request #29096: URL: https://github.com/apache/spark/pull/29096#issuecomment-658018293 **[Test build #125811 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125811/testReport)** for PR 29096 at commit [`cc298d6`](https://github.com/apache/spark/commit/cc298d61f45dec1712e350adba4c078ef15841e1). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29096: [WIP][TESTS] Enable test-dependencies.sh and Unidoc test in Jenkins jobs
AmplabJenkins removed a comment on pull request #29096: URL: https://github.com/apache/spark/pull/29096#issuecomment-658018964 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #29096: [WIP][TESTS] Enable test-dependencies.sh and Unidoc test in Jenkins jobs
SparkQA removed a comment on pull request #29096: URL: https://github.com/apache/spark/pull/29096#issuecomment-658018293 **[Test build #125811 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125811/testReport)** for PR 29096 at commit [`cc298d6`](https://github.com/apache/spark/commit/cc298d61f45dec1712e350adba4c078ef15841e1). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29096: [WIP][TESTS] Enable test-dependencies.sh and Unidoc test in Jenkins jobs
SparkQA commented on pull request #29096: URL: https://github.com/apache/spark/pull/29096#issuecomment-658019061 **[Test build #125811 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125811/testReport)** for PR 29096 at commit [`cc298d6`](https://github.com/apache/spark/commit/cc298d61f45dec1712e350adba4c078ef15841e1). * This patch **fails build dependency tests**. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29096: [WIP][TESTS] Enable test-dependencies.sh and Unidoc test in Jenkins jobs
AmplabJenkins commented on pull request #29096: URL: https://github.com/apache/spark/pull/29096#issuecomment-658018964 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29096: [WIP][TESTS] Enable test-dependencies.sh and Unidoc test in Jenkins jobs
AmplabJenkins removed a comment on pull request #29096: URL: https://github.com/apache/spark/pull/29096#issuecomment-658019090 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/125811/ Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] joanjiao2016 commented on pull request #21038: [SPARK-22968][DStream] Throw an exception on partition revoking issue
joanjiao2016 commented on pull request #21038: URL: https://github.com/apache/spark/pull/21038#issuecomment-658020895 @koeninger Hi, we have prepared two spark streaming applications with the same group id to run respectively on different cluster for disaster recovery,the first application will failed when the second application started a few minutes later, and threw exception as: java.lang.IllegalStateException: No current assignment for partition xxx This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29096: [WIP][TESTS] Enable test-dependencies.sh and Unidoc test in Jenkins jobs
SparkQA commented on pull request #29096: URL: https://github.com/apache/spark/pull/29096#issuecomment-658021416 **[Test build #125812 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125812/testReport)** for PR 29096 at commit [`f86a96f`](https://github.com/apache/spark/commit/f86a96fb483ffa08c0c84859b1b77c710c776e27). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29096: [WIP][TESTS] Enable test-dependencies.sh and Unidoc test in Jenkins jobs
AmplabJenkins commented on pull request #29096: URL: https://github.com/apache/spark/pull/29096#issuecomment-658021945 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29096: [WIP][TESTS] Enable test-dependencies.sh and Unidoc test in Jenkins jobs
AmplabJenkins removed a comment on pull request #29096: URL: https://github.com/apache/spark/pull/29096#issuecomment-658021945 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ulysses-you commented on a change in pull request #28840: [SPARK-31999][SQL] Add REFRESH FUNCTION command
ulysses-you commented on a change in pull request #28840: URL: https://github.com/apache/spark/pull/28840#discussion_r454162497 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/command/functions.scala ## @@ -236,6 +236,45 @@ case class ShowFunctionsCommand( } } + +/** + * A command for users to refresh the persistent function. + * The syntax of using this command in SQL is: + * {{{ + *REFRESH FUNCTION functionName + * }}} + */ +case class RefreshFunctionCommand( +databaseName: Option[String], +functionName: String) + extends RunnableCommand { + + override def run(sparkSession: SparkSession): Seq[Row] = { +val catalog = sparkSession.sessionState.catalog +if (FunctionRegistry.builtin.functionExists(FunctionIdentifier(functionName))) { + throw new AnalysisException(s"Cannot refresh builtin function $functionName") +} +if (catalog.isTemporaryFunction(FunctionIdentifier(functionName, databaseName))) { + throw new AnalysisException(s"Cannot refresh temporary function $functionName") +} + +val identifier = FunctionIdentifier( + functionName, Some(databaseName.getOrElse(catalog.getCurrentDatabase))) +// we only refresh the permanent function. +if (catalog.isPersistentFunction(identifier)) { + // register overwrite function. + val func = catalog.getFunctionMetadata(identifier) + catalog.registerFunction(func, true) +} else { + // function is not exists, clear cached function. + catalog.unregisterFunction(identifier, true) + throw new NoSuchFunctionException(identifier.database.get, functionName) Review comment: `REFRESH TABLE` doesn't do the side-effects, it always check the table if exist first. I think it's necessary to have both of invalid cache and throw exception. * It's confused that we can still use or desc a not exist function if we just throw exception. * It's also confused that we can refresh any function name without an exception if we just clear cache. BTW current `REFRESH TABLE` exists a minor memory leak in this case ``` -- client a execute create table t1(c1 int); cache table t1; -- client b execute drop table t1; create table t1(c1 int, c2 int); uncache table t1. -- client a.t1 produce a memory leak -- the reason is spark think it's a plan cache but user may think it's a table cache ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] wangyum commented on a change in pull request #29088: [SPARK-32289][SQL] Some characters are garbled when opening csv files with Excel
wangyum commented on a change in pull request #29088: URL: https://github.com/apache/spark/pull/29088#discussion_r454167871 ## File path: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala ## @@ -2353,6 +2355,53 @@ abstract class CSVSuite extends QueryTest with SharedSparkSession with TestCsvDa assert(df.schema.last == StructField("col_mixed_types", StringType, true)) } } + + test("Some characters are garbled when opening csv files with Excel") { +// scalastyle:off nonascii +val chinese = "我爱中文" +val korean = "나는 한국인을 좋아한다" Review comment: Is it correct? I'm not sure. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] beliefer commented on pull request #28917: [SPARK-31847][CORE][TESTS] DAGSchedulerSuite: Rewrite the test framework to support apply specified spark configurations.
beliefer commented on pull request #28917: URL: https://github.com/apache/spark/pull/28917#issuecomment-658027966 retest this please This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #28917: [SPARK-31847][CORE][TESTS] DAGSchedulerSuite: Rewrite the test framework to support apply specified spark configurations.
AmplabJenkins commented on pull request #28917: URL: https://github.com/apache/spark/pull/28917#issuecomment-658028313 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] peter-toth commented on pull request #29053: [SPARK-32241][SQL] Remove empty children of union
peter-toth commented on pull request #29053: URL: https://github.com/apache/spark/pull/29053#issuecomment-658028307 Thanks for the review. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #29096: [WIP][TESTS] Enable test-dependencies.sh and Unidoc test in Jenkins jobs
HyukjinKwon commented on pull request #29096: URL: https://github.com/apache/spark/pull/29096#issuecomment-658028639 retest this please This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #29088: [SPARK-32289][SQL] Some characters are garbled when opening csv files with Excel
HyukjinKwon commented on a change in pull request #29088: URL: https://github.com/apache/spark/pull/29088#discussion_r454170678 ## File path: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala ## @@ -2353,6 +2355,53 @@ abstract class CSVSuite extends QueryTest with SharedSparkSession with TestCsvDa assert(df.schema.last == StructField("col_mixed_types", StringType, true)) } } + + test("Some characters are garbled when opening csv files with Excel") { +// scalastyle:off nonascii +val chinese = "我爱中文" +val korean = "나는 한국인을 좋아한다" Review comment: Yup! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28917: [SPARK-31847][CORE][TESTS] DAGSchedulerSuite: Rewrite the test framework to support apply specified spark configurations.
AmplabJenkins removed a comment on pull request #28917: URL: https://github.com/apache/spark/pull/28917#issuecomment-658028313 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #29053: [SPARK-32241][SQL] Remove empty children of union
dongjoon-hyun commented on pull request #29053: URL: https://github.com/apache/spark/pull/29053#issuecomment-658030131 Thank you, all! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29096: [WIP][TESTS] Enable test-dependencies.sh and Unidoc test in Jenkins jobs
SparkQA commented on pull request #29096: URL: https://github.com/apache/spark/pull/29096#issuecomment-658031266 **[Test build #125813 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125813/testReport)** for PR 29096 at commit [`f86a96f`](https://github.com/apache/spark/commit/f86a96fb483ffa08c0c84859b1b77c710c776e27). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #28917: [SPARK-31847][CORE][TESTS] DAGSchedulerSuite: Rewrite the test framework to support apply specified spark configurations.
SparkQA commented on pull request #28917: URL: https://github.com/apache/spark/pull/28917#issuecomment-658031305 **[Test build #125814 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125814/testReport)** for PR 28917 at commit [`ec0d8d0`](https://github.com/apache/spark/commit/ec0d8d00b64662343dc6b3945dc5999343b699a7). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29096: [WIP][TESTS] Enable test-dependencies.sh and Unidoc test in Jenkins jobs
AmplabJenkins commented on pull request #29096: URL: https://github.com/apache/spark/pull/29096#issuecomment-658031929 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun closed pull request #29093: [SPARK-32220][SQL][3.0][FOLLOW-UP]SHUFFLE_REPLICATE_NL Hint should not change Non-Cartesian Product join result
dongjoon-hyun closed pull request #29093: URL: https://github.com/apache/spark/pull/29093 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29096: [WIP][TESTS] Enable test-dependencies.sh and Unidoc test in Jenkins jobs
AmplabJenkins removed a comment on pull request #29096: URL: https://github.com/apache/spark/pull/29096#issuecomment-658031929 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28960: [SPARK-32140][ML][PySpark] Add training summary to FMClassificationModel
AmplabJenkins removed a comment on pull request #28960: URL: https://github.com/apache/spark/pull/28960#issuecomment-658009398 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #28960: [SPARK-32140][ML][PySpark] Add training summary to FMClassificationModel
SparkQA removed a comment on pull request #28960: URL: https://github.com/apache/spark/pull/28960#issuecomment-658009354 **[Test build #125810 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125810/testReport)** for PR 28960 at commit [`9a58603`](https://github.com/apache/spark/commit/9a58603ce88b2c3116f6ce77a5144151cffab4ad). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #28960: [SPARK-32140][ML][PySpark] Add training summary to FMClassificationModel
SparkQA commented on pull request #28960: URL: https://github.com/apache/spark/pull/28960#issuecomment-658035984 **[Test build #125810 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125810/testReport)** for PR 28960 at commit [`9a58603`](https://github.com/apache/spark/commit/9a58603ce88b2c3116f6ce77a5144151cffab4ad). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #28960: [SPARK-32140][ML][PySpark] Add training summary to FMClassificationModel
AmplabJenkins commented on pull request #28960: URL: https://github.com/apache/spark/pull/28960#issuecomment-658036270 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] mayurdb opened a new pull request #29097: Spark 32299
mayurdb opened a new pull request #29097: URL: https://github.com/apache/spark/pull/29097 ### What changes were proposed in this pull request? To change SortMergeJoin orientation at runtime using adaptive query execution ### Why are the changes needed? For SortMerge join of type EquiJoin, the left and right side of the joins are decided on the basis of the user order. In SMJ, the left side of the join is streamed and the right side is buffered (matching values). Because of this, B SMJ A would perform better than A SMJ B if, sizeOf(B) > sizeOf(A) With adaptive query execution, once both ShuffleQueryStages corresponding to the join have completed and if none of them have sizes lesser than the broadcast threshold (the join will not be converted to BroadcastHashJoin), join orientation can be changed at run time. ### Does this PR introduce _any_ user-facing change? No --> ### How was this patch tested? - Added unit tests - Ran AdaptiveQueryExecSuite This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29097: Spark 32299
AmplabJenkins commented on pull request #29097: URL: https://github.com/apache/spark/pull/29097#issuecomment-658038551 Can one of the admins verify this patch? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #28840: [SPARK-31999][SQL] Add REFRESH FUNCTION command
cloud-fan commented on a change in pull request #28840: URL: https://github.com/apache/spark/pull/28840#discussion_r454182771 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/command/functions.scala ## @@ -236,6 +236,45 @@ case class ShowFunctionsCommand( } } + +/** + * A command for users to refresh the persistent function. + * The syntax of using this command in SQL is: + * {{{ + *REFRESH FUNCTION functionName + * }}} + */ +case class RefreshFunctionCommand( +databaseName: Option[String], +functionName: String) + extends RunnableCommand { + + override def run(sparkSession: SparkSession): Seq[Row] = { +val catalog = sparkSession.sessionState.catalog +if (FunctionRegistry.builtin.functionExists(FunctionIdentifier(functionName))) { + throw new AnalysisException(s"Cannot refresh builtin function $functionName") +} +if (catalog.isTemporaryFunction(FunctionIdentifier(functionName, databaseName))) { + throw new AnalysisException(s"Cannot refresh temporary function $functionName") +} + +val identifier = FunctionIdentifier( + functionName, Some(databaseName.getOrElse(catalog.getCurrentDatabase))) +// we only refresh the permanent function. +if (catalog.isPersistentFunction(identifier)) { + // register overwrite function. + val func = catalog.getFunctionMetadata(identifier) + catalog.registerFunction(func, true) +} else { + // function is not exists, clear cached function. + catalog.unregisterFunction(identifier, true) + throw new NoSuchFunctionException(identifier.database.get, functionName) Review comment: It depends on how you define "function exists". If users can still use this function in SQL queries, why do we throw NoSuchFunctionException in REFRESH FUNCTION? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gaborgsomogyi commented on a change in pull request #29024: [WIP][SPARK-32001][SQL]Create JDBC authentication provider developer API
gaborgsomogyi commented on a change in pull request #29024: URL: https://github.com/apache/spark/pull/29024#discussion_r454183194 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCOptions.scala ## @@ -23,12 +23,15 @@ import java.util.{Locale, Properties} import org.apache.commons.io.FilenameUtils import org.apache.spark.SparkFiles +import org.apache.spark.annotation.DeveloperApi import org.apache.spark.internal.Logging import org.apache.spark.sql.catalyst.util.CaseInsensitiveMap /** + * ::DeveloperApi:: * Options for the JDBC data source. */ +@DeveloperApi Review comment: We could pass the 2 params but then we limit further implementation possibilities so I would vote on the map. At the moment there is no need other params other than `keytab` and `principal` but later providers may need further things. It's not a strong opinion, just don't want to close later possibilities. If we agree on the way I'll do the changes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29097: Spark 32299
AmplabJenkins commented on pull request #29097: URL: https://github.com/apache/spark/pull/29097#issuecomment-658041588 Can one of the admins verify this patch? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29097: Spark 32299
AmplabJenkins removed a comment on pull request #29097: URL: https://github.com/apache/spark/pull/29097#issuecomment-658038551 Can one of the admins verify this patch? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #28676: [SPARK-31869][SQL] BroadcastHashJoinExec can utilize the build side for its output partitioning
cloud-fan commented on a change in pull request #28676: URL: https://github.com/apache/spark/pull/28676#discussion_r454196515 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastHashJoinExec.scala ## @@ -60,6 +62,67 @@ case class BroadcastHashJoinExec( } } + override lazy val outputPartitioning: Partitioning = { +joinType match { + case _: InnerLike => +streamedPlan.outputPartitioning match { + case h: HashPartitioning => expandOutputPartitioning(h) + case c: PartitioningCollection => expandOutputPartitioning(c) + case other => other +} + case _ => streamedPlan.outputPartitioning +} + } + + // An one-to-many mapping from a streamed key to build keys. + private lazy val streamedKeyToBuildKeyMapping = { +val mapping = mutable.Map.empty[Expression, Seq[Expression]] +streamedKeys.zip(buildKeys).foreach { + case (streamedKey, buildKey) => +val key = streamedKey.canonicalized +mapping.get(key) match { + case Some(v) => mapping.put(key, v :+ buildKey) + case None => mapping.put(key, Seq(buildKey)) +} +} +mapping.toMap + } + + // Expands the given partitioning collection recursively. + private def expandOutputPartitioning( + partitioning: PartitioningCollection): PartitioningCollection = { +PartitioningCollection(partitioning.partitionings.flatMap { + case h: HashPartitioning => expandOutputPartitioning(h).partitionings + case c: PartitioningCollection => Seq(expandOutputPartitioning(c)) + case other => Seq(other) +}) + } + + // Expands the given hash partitioning by substituting streamed keys with build keys. + // For example, if the expressions for the given partitioning are Seq("a", "b", "c") + // where the streamed keys are Seq("b", "c") and the build keys are Seq("x", "y"), + // the expanded partitioning will have the following expressions: + // Seq("a", "b", "c"), Seq("a", "b", "y"), Seq("a", "x", "c"), Seq("a", "x", "y"). + // The expanded expressions are returned as PartitioningCollection. + private def expandOutputPartitioning(partitioning: HashPartitioning): PartitioningCollection = { +def generateExprCombinations( +current: Seq[Expression], +accumulated: Seq[Expression]): Seq[Seq[Expression]] = { + if (current.isEmpty) { +Seq(accumulated) + } else { +val buildKeys = streamedKeyToBuildKeyMapping.get(current.head.canonicalized) +generateExprCombinations(current.tail, accumulated :+ current.head) ++ + buildKeys.map { _.flatMap(b => generateExprCombinations(current.tail, accumulated :+ b)) Review comment: shall we add an upper bound to avoid creating a too big `PartitioningCollection`? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #28676: [SPARK-31869][SQL] BroadcastHashJoinExec can utilize the build side for its output partitioning
cloud-fan commented on a change in pull request #28676: URL: https://github.com/apache/spark/pull/28676#discussion_r454196797 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastHashJoinExec.scala ## @@ -60,6 +62,67 @@ case class BroadcastHashJoinExec( } } + override lazy val outputPartitioning: Partitioning = { +joinType match { + case _: InnerLike => +streamedPlan.outputPartitioning match { + case h: HashPartitioning => expandOutputPartitioning(h) + case c: PartitioningCollection => expandOutputPartitioning(c) + case other => other +} + case _ => streamedPlan.outputPartitioning +} + } + + // An one-to-many mapping from a streamed key to build keys. + private lazy val streamedKeyToBuildKeyMapping = { +val mapping = mutable.Map.empty[Expression, Seq[Expression]] +streamedKeys.zip(buildKeys).foreach { + case (streamedKey, buildKey) => +val key = streamedKey.canonicalized +mapping.get(key) match { + case Some(v) => mapping.put(key, v :+ buildKey) + case None => mapping.put(key, Seq(buildKey)) +} +} +mapping.toMap + } + + // Expands the given partitioning collection recursively. + private def expandOutputPartitioning( + partitioning: PartitioningCollection): PartitioningCollection = { +PartitioningCollection(partitioning.partitionings.flatMap { + case h: HashPartitioning => expandOutputPartitioning(h).partitionings + case c: PartitioningCollection => Seq(expandOutputPartitioning(c)) + case other => Seq(other) +}) + } + + // Expands the given hash partitioning by substituting streamed keys with build keys. + // For example, if the expressions for the given partitioning are Seq("a", "b", "c") + // where the streamed keys are Seq("b", "c") and the build keys are Seq("x", "y"), + // the expanded partitioning will have the following expressions: + // Seq("a", "b", "c"), Seq("a", "b", "y"), Seq("a", "x", "c"), Seq("a", "x", "y"). + // The expanded expressions are returned as PartitioningCollection. + private def expandOutputPartitioning(partitioning: HashPartitioning): PartitioningCollection = { +def generateExprCombinations( +current: Seq[Expression], +accumulated: Seq[Expression]): Seq[Seq[Expression]] = { + if (current.isEmpty) { +Seq(accumulated) + } else { +val buildKeys = streamedKeyToBuildKeyMapping.get(current.head.canonicalized) +generateExprCombinations(current.tail, accumulated :+ current.head) ++ + buildKeys.map { _.flatMap(b => generateExprCombinations(current.tail, accumulated :+ b)) Review comment: Or we can create a special `HashPartitioning` which works like a lazy `PartitioningCollection` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon opened a new pull request #29098: [SPARK-32300][PYTHON][2.4] toPandas should work from a Spark DataFrame with no partitions
HyukjinKwon opened a new pull request #29098: URL: https://github.com/apache/spark/pull/29098 ### What changes were proposed in this pull request? This PR proposes to just simply by-pass the case when the number of array size is negative, when it collects data from Spark DataFrame with no partitions for `toPandas`. ```python spark.sparkContext.emptyRDD().toDF("col1 int").toPandas() ``` In the master and branch-3.0, this was fixed together at https://github.com/apache/spark/commit/ecaa495b1fe532c36e952ccac42f4715809476af but it's legitimately not ported back. ### Why are the changes needed? To make empty Spark DataFrame able to be a pandas DataFrame. ### Does this PR introduce _any_ user-facing change? Yes, ```python spark.sparkContext.emptyRDD().toDF("col1 int").toPandas() ``` **Before:** ``` ... Caused by: java.lang.NegativeArraySizeException at org.apache.spark.sql.Dataset$$anonfun$collectAsArrowToPython$1$$anonfun$apply$17.apply(Dataset.scala:3293) at org.apache.spark.sql.Dataset$$anonfun$collectAsArrowToPython$1$$anonfun$apply$17.apply(Dataset.scala:3287) at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3370) at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:80) ... ``` **After:** ``` Empty DataFrame Columns: [col1] Index: [] ``` ### How was this patch tested? Manually tested and unittest were added. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29098: [SPARK-32300][PYTHON][2.4] toPandas should work from a Spark DataFrame with no partitions
SparkQA commented on pull request #29098: URL: https://github.com/apache/spark/pull/29098#issuecomment-658053443 **[Test build #125815 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125815/testReport)** for PR 29098 at commit [`c3a7f7e`](https://github.com/apache/spark/commit/c3a7f7ea780799541bba869f65fd0fa275b84974). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29098: [SPARK-32300][PYTHON][2.4] toPandas should work from a Spark DataFrame with no partitions
AmplabJenkins commented on pull request #29098: URL: https://github.com/apache/spark/pull/29098#issuecomment-658054104 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29098: [SPARK-32300][PYTHON][2.4] toPandas should work from a Spark DataFrame with no partitions
AmplabJenkins removed a comment on pull request #29098: URL: https://github.com/apache/spark/pull/29098#issuecomment-658054104 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #29045: [SPARK-32234][SQL] Spark sql commands are failing on selecting the orc tables
cloud-fan commented on a change in pull request #29045: URL: https://github.com/apache/spark/pull/29045#discussion_r454201305 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala ## @@ -116,47 +116,53 @@ object OrcUtils extends Logging { } /** - * Returns the requested column ids from the given ORC file. Column id can be -1, which means the - * requested column doesn't exist in the ORC file. Returns None if the given ORC file is empty. + * @return Returns the requested column ids from the given ORC file and Boolean flag to use actual + * schema or result schema. Column id can be -1, which means the requested column doesn't + * exist in the ORC file. Returns None if the given ORC file is empty. */ def requestedColumnIds( isCaseSensitive: Boolean, dataSchema: StructType, requiredSchema: StructType, reader: Reader, - conf: Configuration): Option[Array[Int]] = { + conf: Configuration): (Option[Array[Int]], Boolean) = { +var sendActualSchema = false val orcFieldNames = reader.getSchema.getFieldNames.asScala Review comment: Please correct me if I'm wrong: 1. the physical orc file schema is `_col0`, ... 2. the table schema in metastore is `d_date_sk`, ... 3. the query only requires only `d_year` I don't know why the query fails. The `requestedColumnIds` will be `[6]` and the orc reader will read the `_col6` column. Everything should be fine. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #29085: [SPARK-32106][SQL]Implement SparkScriptTransformationExec in sql/core
cloud-fan commented on pull request #29085: URL: https://github.com/apache/spark/pull/29085#issuecomment-658057950 Can we use `Cast` to turn catalyst value to string and pass to the script? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29098: [SPARK-32300][PYTHON][2.4] toPandas should work from a Spark DataFrame with no partitions
SparkQA commented on pull request #29098: URL: https://github.com/apache/spark/pull/29098#issuecomment-658057520 **[Test build #125816 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125816/testReport)** for PR 29098 at commit [`8074075`](https://github.com/apache/spark/commit/80740755c822715e8e8956517ee4ecb73c962348). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29098: [SPARK-32300][PYTHON][2.4] toPandas should work from a Spark DataFrame with no partitions
AmplabJenkins commented on pull request #29098: URL: https://github.com/apache/spark/pull/29098#issuecomment-658058126 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29098: [SPARK-32300][PYTHON][2.4] toPandas should work from a Spark DataFrame with no partitions
AmplabJenkins removed a comment on pull request #29098: URL: https://github.com/apache/spark/pull/29098#issuecomment-658058126 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ulysses-you commented on a change in pull request #28840: [SPARK-31999][SQL] Add REFRESH FUNCTION command
ulysses-you commented on a change in pull request #28840: URL: https://github.com/apache/spark/pull/28840#discussion_r454204568 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/command/functions.scala ## @@ -236,6 +236,45 @@ case class ShowFunctionsCommand( } } + +/** + * A command for users to refresh the persistent function. + * The syntax of using this command in SQL is: + * {{{ + *REFRESH FUNCTION functionName + * }}} + */ +case class RefreshFunctionCommand( +databaseName: Option[String], +functionName: String) + extends RunnableCommand { + + override def run(sparkSession: SparkSession): Seq[Row] = { +val catalog = sparkSession.sessionState.catalog +if (FunctionRegistry.builtin.functionExists(FunctionIdentifier(functionName))) { + throw new AnalysisException(s"Cannot refresh builtin function $functionName") +} +if (catalog.isTemporaryFunction(FunctionIdentifier(functionName, databaseName))) { + throw new AnalysisException(s"Cannot refresh temporary function $functionName") +} + +val identifier = FunctionIdentifier( + functionName, Some(databaseName.getOrElse(catalog.getCurrentDatabase))) +// we only refresh the permanent function. +if (catalog.isPersistentFunction(identifier)) { + // register overwrite function. + val func = catalog.getFunctionMetadata(identifier) + catalog.registerFunction(func, true) +} else { + // function is not exists, clear cached function. + catalog.unregisterFunction(identifier, true) + throw new NoSuchFunctionException(identifier.database.get, functionName) Review comment: how about this ``` if (catalog.isPersistentFunction(identifier)) { // register overwrite function. val func = catalog.getFunctionMetadata(identifier) catalog.registerFunction(func, true) } else if (catalog.isRegisteredFunction(identifier)) { // clear cached function. catalog.unregisterFunction(identifier, true) } else { throw new NoSuchFunctionException(identifier.database.get, functionName) } ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon opened a new pull request #29099: [SPARK-32301][PYTHON][TESTS] Add a test case for toPandas to work with empty partitioned Spark DataFrame
HyukjinKwon opened a new pull request #29099: URL: https://github.com/apache/spark/pull/29099 ### What changes were proposed in this pull request? This PR proposes to port the test case from https://github.com/apache/spark/pull/29098 to branch-3.0 and master. In the master and branch-3.0, this was fixed together at https://github.com/apache/spark/commit/ecaa495b1fe532c36e952ccac42f4715809476af but no partition case is not being tested. ### Why are the changes needed? To improve test coverage. ### Does this PR introduce _any_ user-facing change? No, test-only. ### How was this patch tested? Unit test was forward-ported. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29099: [SPARK-32301][PYTHON][TESTS] Add a test case for toPandas to work with empty partitioned Spark DataFrame
SparkQA commented on pull request #29099: URL: https://github.com/apache/spark/pull/29099#issuecomment-658061150 **[Test build #125817 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125817/testReport)** for PR 29099 at commit [`e986c65`](https://github.com/apache/spark/commit/e986c65f4e968bf58d16569055eda13414f5ec33). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29099: [SPARK-32301][PYTHON][TESTS] Add a test case for toPandas to work with empty partitioned Spark DataFrame
AmplabJenkins commented on pull request #29099: URL: https://github.com/apache/spark/pull/29099#issuecomment-658061766 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29098: [SPARK-32300][PYTHON][2.4] toPandas should work from a Spark DataFrame with no partitions
AmplabJenkins removed a comment on pull request #29098: URL: https://github.com/apache/spark/pull/29098#issuecomment-658061866 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29098: [SPARK-32300][PYTHON][2.4] toPandas should work from a Spark DataFrame with no partitions
AmplabJenkins commented on pull request #29098: URL: https://github.com/apache/spark/pull/29098#issuecomment-658061866 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29099: [SPARK-32301][PYTHON][TESTS] Add a test case for toPandas to work with empty partitioned Spark DataFrame
AmplabJenkins removed a comment on pull request #29099: URL: https://github.com/apache/spark/pull/29099#issuecomment-658061766 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #29064: [SPARK-32272][SQL] Add SQL standard command SET TIME ZONE
cloud-fan commented on pull request #29064: URL: https://github.com/apache/spark/pull/29064#issuecomment-658062375 We should also add a document page in SQL reference for it. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] adjordan commented on pull request #29080: [SPARK-32271][ML] Update CrossValidator to train folds in parallel
adjordan commented on pull request #29080: URL: https://github.com/apache/spark/pull/29080#issuecomment-658063483 This is ready for review! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a change in pull request #29078: [SPARK-29292][STREAMING][SQL][BUILD] Get streaming, catalyst, sql compiling for Scala 2.13
dongjoon-hyun commented on a change in pull request #29078: URL: https://github.com/apache/spark/pull/29078#discussion_r454210666 ## File path: sql/core/src/test/scala/org/apache/spark/sql/DatasetPrimitiveSuite.scala ## @@ -223,16 +223,6 @@ class DatasetPrimitiveSuite extends QueryTest with SharedSparkSession { checkDataset(Seq(Queue(true)).toDS(), Queue(true)) checkDataset(Seq(Queue("test")).toDS(), Queue("test")) checkDataset(Seq(Queue(Tuple1(1))).toDS(), Queue(Tuple1(1))) - -checkDataset(Seq(ArrayBuffer(1)).toDS(), ArrayBuffer(1)) Review comment: Although this means the removal of test coverage in Scala 2.12, I'm +1 for now. We can add back later after we finished everything in Scala 2.13. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gaborgsomogyi commented on a change in pull request #29024: [WIP][SPARK-32001][SQL]Create JDBC authentication provider developer API
gaborgsomogyi commented on a change in pull request #29024: URL: https://github.com/apache/spark/pull/29024#discussion_r454212309 ## File path: core/src/main/scala/org/apache/spark/security/SecurityConfigurationLock.scala ## @@ -0,0 +1,28 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.security + +import org.apache.spark.annotation.DeveloperApi + +/** + * ::DeveloperApi:: + * There are cases when global JVM security configuration must be modified. + * In order to avoid race the modification must be synchronized with this. + */ +@DeveloperApi +object SecurityConfigurationLock Review comment: `Considering this I would not add synchronization into the framework` = Adding synchronization into a central place (like `ConnectionProvider.create`) and allowing 3rd-party developers not to care about this is not something where I see the gain (I see cases where such change would do unnecessary synchronization). That said physically it's not an issue but could be misleading. An example: ``` def create(driver: Driver, options: JDBCOptions): Connection = { val filteredProviders = providers.filter(_.canHandle(driver, options)) logDebug(s"Filtered providers: $filteredProviders") require(filteredProviders.size == 1, "JDBC connection initiated but not exactly one connection provider found which can handle it") var conn: Connection = null // This would synchronize but for nothing in some cases SecurityConfigurationLock.synchronized { conn = filteredProviders.head.getConnection(driver, options) } conn } ``` An imaginary provider implemented by 3rd-party: ``` class OracleConnectionProviderTGT { override def canHandle(driver: Driver, options: JDBCOptions): Boolean = { // Example content of tgtCache: "/tmp/krb5cc_5088" options.tgtCache != null ... } override def getConnection(driver: Driver, options: JDBCOptions): Connection = { ... // No need to modify global JVM configuration prop.setProperty(OracleConnection.CONNECTION_PROPERTY_THIN_NET_AUTHENTICATION_KRB5_CC_NAME, options.tgtCache) ... driver.connect(url, prop) } } ``` Overall if we would like to add such change then I would mention that `getConnection` is synchronized under any circustances which may or may not needed. This is not suggested from my perspective but no strong opinion. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29098: [SPARK-32300][PYTHON][2.4] toPandas should work from a Spark DataFrame with no partitions
SparkQA commented on pull request #29098: URL: https://github.com/apache/spark/pull/29098#issuecomment-658065020 **[Test build #125818 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125818/testReport)** for PR 29098 at commit [`070ea46`](https://github.com/apache/spark/commit/070ea46dcfb6521d43f107e509fbb5dd520ec9c8). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on pull request #29018: [SPARK-32202][ML][WIP] tree models auto infer compact integer type
zhengruifeng commented on pull request #29018: URL: https://github.com/apache/spark/pull/29018#issuecomment-658066066 @viirya Thanks for reviewing! > This win only happens when maxBins is less Yes, but in most cases, maxBin(default=32) < 128 > the perf regression happens for all cases Yes, I think so. > I'm also not sure how often memory is an issue when training the model It will make sense if there is no enough memory for orginal treePoint(Array[Int]). I personally think it maybe worthwhile if the regression is small enough, but I am not sure whether current performance results are OK. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon opened a new pull request #29100: [MINOR][R] Match collectAsArrowToR with non-streaming collectAsArrowToPython
HyukjinKwon opened a new pull request #29100: URL: https://github.com/apache/spark/pull/29100 ### What changes were proposed in this pull request? This PR proposes to port forward #29098 to `collectAsArrowToR`. `collectAsArrowToR` follows `collectAsArrowToPython` in branch-2.4 due to the limitation of ARROW-4512. SparkR vectorization currently cannot use streaming format. Note that you cannot create no partition Spark DataFrame in SparkR if I am not wrong. So there is no behaviour changes to end users. ### Why are the changes needed? For simplicity and consistency. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? The same code is being tested in `collectAsArrowToPython` of branch-2.4. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gaborgsomogyi commented on pull request #29024: [WIP][SPARK-32001][SQL]Create JDBC authentication provider developer API
gaborgsomogyi commented on pull request #29024: URL: https://github.com/apache/spark/pull/29024#issuecomment-658066311 retest this please This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org