[GitHub] spark pull request #19920: [SPARK-21672][CORE] Remove SHS-specific applicati...
Github user smurakozi commented on a diff in the pull request: https://github.com/apache/spark/pull/19920#discussion_r155718836 --- Diff: core/src/main/scala/org/apache/spark/deploy/history/HistoryPage.scala --- @@ -88,4 +90,8 @@ private[history] class HistoryPage(parent: HistoryServer) extends WebUIPage("") private def makePageLink(showIncomplete: Boolean): String = { UIUtils.prependBaseUri("/?" + "showIncomplete=" + showIncomplete) } + + private def isApplicationCompleted(appInfo: ApplicationInfo): Boolean = { --- End diff -- My first commit had this method in ApplicationInfo, called completed @vanzin: `+case class ApplicationInfo private[spark]( ... +def completed: Boolean = {` Is this used anywhere? I answered we have it in HistoryPage (forgot to mention usage in the test) @vanzin: Can we move this logic there instead? That avoids adding a new visible property in a public object. I moved it, noting the code duplication in a comment on PR. The question is which is better: a bit of duplication (used 3 and defined 2 times, once in application and once in test code) or extending the public API? My gut feeling is that the small duplication is preferable in this case, but I'm a newbie in Spark development so I'm easy to convince that we have different preferences :) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19894: [SPARK-22700][ML] Bucketizer.transform incorrectly drops...
Github user hhbyyh commented on the issue: https://github.com/apache/spark/pull/19894 LGTM. Good fix. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19904: [SPARK-22707][ML] Optimize CrossValidator memory occupat...
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19904 @sethah To verify the memory issue, you can add one line test code against current master at here: ``` val modelFutures = ... // Unpersist training data only when all models have trained Future.sequence[Model[_], Iterable](modelFutures)(implicitly, executionContext) .onComplete { _ => trainingDataset.unpersist() } (executionContext) // Evaluate models in a Future that will calulate a metric and allow model to be cleaned up val foldMetricFutures = // Wait for metrics to be calculated before unpersisting validation dataset val foldMetrics = foldMetricFutures.map(ThreadUtils.awaitResult(_, Duration.Inf)) validationDataset.unpersist() //add test code here, fetch all models val models = modelFutures.map(_.value.get.get) foldMetrics ``` The test code I add here is **val models = modelFutures.map(_.value.get.get)** So it can prove that these models are still in memory, we can get them. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19904: [SPARK-22707][ML] Optimize CrossValidator memory ...
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/19904#discussion_r155715665 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala --- @@ -146,25 +147,18 @@ class CrossValidator @Since("1.2.0") (@Since("1.4.0") override val uid: String) val validationDataset = sparkSession.createDataFrame(validation, schema).cache() logDebug(s"Train split $splitIndex with multiple sets of parameters.") + val completeFitCount = new AtomicInteger(0) --- End diff -- We hope to unpersist training dataset once all fitting finished. But your idea here will postpone the unpersist time until all fitting & evaluation done. and your code should have the same effect with: ``` val modelFutures = ... val foldMetrics = modelFutures.map(ThreadUtils.awaitResult(_, Duration.Inf)) trainingDataset.unpersist() validationDataset.unpersist() ``` and, about what you said: > "Now, the unpersist operation will happen in one of the training threads, instead of asynchronously in its own thread. " What's the possible effect or impact ? `trainingDataset.unpersist()` itself is a async method and won't block. So will it have some effect ? I think it can be put in any thread safely. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19852: [SPARK-22655][PySpark] Throw exception rather than exit ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19852 **[Test build #84647 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84647/testReport)** for PR 19852 at commit [`f606951`](https://github.com/apache/spark/commit/f606951b2fe2f74a0aa95055e17e479499c82e03). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19852: [SPARK-22655][PySpark] Throw exception rather than exit ...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/19852 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19921: [SPARK-22452][SQL] Add getDouble to DataSourceV2Options
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/19921 thanks, merging to master! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19921: [SPARK-22452][SQL] Add getDouble to DataSourceV2O...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/19921 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19813: [SPARK-22600][SQL] Fix 64kb limit for deeply nested expr...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19813 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19813: [SPARK-22600][SQL] Fix 64kb limit for deeply nested expr...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19813 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84641/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19813: [SPARK-22600][SQL] Fix 64kb limit for deeply nested expr...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19813 **[Test build #84641 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84641/testReport)** for PR 19813 at commit [`7fc3e04`](https://github.com/apache/spark/commit/7fc3e04b9894214e67227beeb14165610e4887d2). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `case class ExprInputVar(exprCode: ExprCode, dataType: DataType, nullable: Boolean)` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19813: [SPARK-22600][SQL] Fix 64kb limit for deeply nested expr...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19813 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84642/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19813: [SPARK-22600][SQL] Fix 64kb limit for deeply nested expr...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19813 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19813: [SPARK-22600][SQL] Fix 64kb limit for deeply nested expr...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19813 **[Test build #84642 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84642/testReport)** for PR 19813 at commit [`9443011`](https://github.com/apache/spark/commit/9443011978c32c611e950a6193f05aa666437f50). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `case class ExprInputVar(exprCode: ExprCode, dataType: DataType, nullable: Boolean)` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19717: [SPARK-22646] [Submission] Spark on Kubernetes - ...
Github user liyinan926 commented on a diff in the pull request: https://github.com/apache/spark/pull/19717#discussion_r155711631 --- Diff: resource-managers/kubernetes/docker/src/main/dockerfiles/spark-base/Dockerfile --- @@ -0,0 +1,47 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +FROM openjdk:8-alpine --- End diff -- @jason-dai Can you follow up on https://github.com/apache-spark-on-k8s/spark/issues/326 directly? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19621: [SPARK-11215][ML] Add multiple columns support to String...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/19621 maybe we could also change the test itself to make it more deterministic? we could first create a new test dataset that avoid having frequency values, run it through the original implementation, then run through this new change to make sure they match. @actuaryzhang do you have any thought on this? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19904: [SPARK-22707][ML] Optimize CrossValidator memory ...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/19904#discussion_r155710913 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala --- @@ -146,25 +147,18 @@ class CrossValidator @Since("1.2.0") (@Since("1.4.0") override val uid: String) val validationDataset = sparkSession.createDataFrame(validation, schema).cache() logDebug(s"Train split $splitIndex with multiple sets of parameters.") + val completeFitCount = new AtomicInteger(0) --- End diff -- My understanding of Scala futures may be off here, but this seems to change the behavior to me. Now, the unpersist operation will happen in one of the training threads, instead of asynchronously in its own thread. I'm not sure how much of an effect that will have. Why can't you just put all the logic in one map statement like below: scala // Fit models in a Future for training in parallel val modelFutures = epm.zipWithIndex.map { case (paramMap, paramIndex) => Future[Model[_]] { val model = est.fit(trainingDataset, paramMap).asInstanceOf[Model[_]] if (collectSubModelsParam) { subModels.get(splitIndex)(paramIndex) = model } // TODO: duplicate evaluator to take extra params from input val metric = eval.evaluate(model.transform(validationDataset, paramMap)) logDebug(s"Got metric $metric for model trained with $paramMap.") metric } (executionContext) } // Unpersist training data only when all models have trained Future.sequence[Model[_], Iterable](modelFutures)(implicitly, executionContext) .onComplete { _ => trainingDataset.unpersist() } (executionContext) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19717: [SPARK-22646] [Submission] Spark on Kubernetes - ...
Github user liyinan926 commented on a diff in the pull request: https://github.com/apache/spark/pull/19717#discussion_r155710752 --- Diff: resource-managers/kubernetes/docker/src/main/dockerfiles/spark-base/Dockerfile --- @@ -0,0 +1,47 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +FROM openjdk:8-alpine --- End diff -- The docker files in this PR are for references only for now, and are by no means the official docker images moving forward. We will address concerns around the base image as part of SPARK-22647. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19904: [SPARK-22707][ML] Optimize CrossValidator memory occupat...
Github user sethah commented on the issue: https://github.com/apache/spark/pull/19904 Can you share your test/results with us? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19717: [SPARK-22646] [Submission] Spark on Kubernetes - ...
Github user jason-dai commented on a diff in the pull request: https://github.com/apache/spark/pull/19717#discussion_r155710219 --- Diff: resource-managers/kubernetes/docker/src/main/dockerfiles/spark-base/Dockerfile --- @@ -0,0 +1,47 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +FROM openjdk:8-alpine --- End diff -- I wonder if there are any follow-ups of this particular issue (apache-spark-on-k8s#326). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19525: [SPARK-22289] [ML] Add JSON support for Matrix parameter...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19525 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84644/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19717: [SPARK-22646] [Submission] Spark on Kubernetes - basic s...
Github user foxish commented on the issue: https://github.com/apache/spark/pull/19717 > Yes, I agree that we eventually should allow client mode including use cases that directly create SparkContext Sorry, in your previous comment, it seemed like you were suggesting disabling it entirely and that had me concerned. As a separate issue, we may in future want to have a nicer error message for inaccessible in-cluster service (`kubernetes.default.svc`). That actually has no dependency on the driver service, or its creation - we can check for that and it should exist anyway. Disabling client mode as you did in the latest commit works for this particular scenario I think. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19525: [SPARK-22289] [ML] Add JSON support for Matrix parameter...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19525 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19525: [SPARK-22289] [ML] Add JSON support for Matrix parameter...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19525 **[Test build #84644 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84644/testReport)** for PR 19525 at commit [`33e08ee`](https://github.com/apache/spark/commit/33e08ee22f2e37f47da246ce197f699351a3bc1b). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19920: [SPARK-21672][CORE] Remove SHS-specific applicati...
Github user gengliangwang commented on a diff in the pull request: https://github.com/apache/spark/pull/19920#discussion_r155708117 --- Diff: core/src/main/scala/org/apache/spark/status/api/v1/api.scala --- @@ -24,27 +24,29 @@ import com.fasterxml.jackson.databind.annotation.JsonDeserialize import org.apache.spark.JobExecutionStatus -class ApplicationInfo private[spark]( -val id: String, -val name: String, -val coresGranted: Option[Int], -val maxCores: Option[Int], -val coresPerExecutor: Option[Int], -val memoryPerExecutorMB: Option[Int], -val attempts: Seq[ApplicationAttemptInfo]) +case class ApplicationInfo private[spark]( +id: String, +name: String, +coresGranted: Option[Int], +maxCores: Option[Int], +coresPerExecutor: Option[Int], +memoryPerExecutorMB: Option[Int], +attempts: Seq[ApplicationAttemptInfo]) { + --- End diff -- The empty brace can be removed here. But I think it should be reserved if a new method `isCompleted` is added here. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19920: [SPARK-21672][CORE] Remove SHS-specific applicati...
Github user gengliangwang commented on a diff in the pull request: https://github.com/apache/spark/pull/19920#discussion_r155708779 --- Diff: core/src/main/scala/org/apache/spark/deploy/history/HistoryPage.scala --- @@ -88,4 +90,8 @@ private[history] class HistoryPage(parent: HistoryServer) extends WebUIPage("") private def makePageLink(showIncomplete: Boolean): String = { UIUtils.prependBaseUri("/?" + "showIncomplete=" + showIncomplete) } + + private def isApplicationCompleted(appInfo: ApplicationInfo): Boolean = { --- End diff -- The function `isApplicationCompleted` is defined 3 times in this patch. I suggest adding a method `isCompleted` in class `ApplicationInfo` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19924: [SPARK-22187][SS][REVERT] Revert change in state ...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/19924 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19924: [SPARK-22187][SS][REVERT] Revert change in state row for...
Github user zsxwing commented on the issue: https://github.com/apache/spark/pull/19924 Thanks. Merging to master. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19717: [SPARK-22646] [Submission] Spark on Kubernetes - basic s...
Github user liyinan926 commented on the issue: https://github.com/apache/spark/pull/19717 > I think we should move the headless service creation into the backend code - anything essential for the backend to run shouldn't depend on the submission client/steps. I agree service creation should be done by the scheduler backend code so to get rid of dependency on the submission client. > Keeping that possible will let us plug Jupyter/spark-shell (running in-cluster for now). Disabling it completely will create an unnecessary dependency on spark-submit which IMO is undesirable. We do want people to be able to programmatically construct a spark context that can point at a k8s cluster in their code I think. kubernetes.default.svc is the in-cluster kube-dns provisioned DNS address that should point to the API server - its availability is a good indicator that we're in a place that can address other pods - so, that can be used to detect when we don't want to let users try and fail client mode. Yes, I agree that we eventually should allow client mode including use cases that directly create `SparkContext`. But until we have a solid well-tested solution, I think we should disable it for now. We can always revisit this once we have a good solution. Regarding `kubernetes.default.svc`, yes, it's a good indication. But again, the driver service must exist. Unless we change to have the backend create that service, this still won't work. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19884: [WIP][SPARK-22324][SQL][PYTHON] Upgrade Arrow to ...
Github user ueshin commented on a diff in the pull request: https://github.com/apache/spark/pull/19884#discussion_r155708588 --- Diff: pom.xml --- @@ -185,7 +185,7 @@ 2.8 1.8 1.0.0 -0.4.0 +0.8.0-SNAPSHOT --- End diff -- Can we download the snapshot from somewhere for our local tests? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19717: [SPARK-22646] [Submission] Spark on Kubernetes - basic s...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19717 **[Test build #84646 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84646/testReport)** for PR 19717 at commit [`44c40b1`](https://github.com/apache/spark/commit/44c40b1b7d546e4d214aff78c282c768aa0e3d42). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19843: [SPARK-22644][ML][TEST][WIP] Make ML testsuite support S...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19843 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19843: [SPARK-22644][ML][TEST][WIP] Make ML testsuite support S...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19843 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84637/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19717: [SPARK-22646] [Submission] Spark on Kubernetes - basic s...
Github user foxish commented on the issue: https://github.com/apache/spark/pull/19717 @liyinan926: > Actually given that in our implementations the executors talk to the driver through a Kubernetes headless service that is only created by the submission client. None of the two cases above would work without careful and tricky configuration hacks. I think we should move the headless service creation into the backend code - anything essential for the backend to run shouldn't depend on the submission client/steps. cc/ @mccheah for comment > So creating a SparkContext directly with a k8s url should not be allowed until we find a way to make it work. Keeping that possible will let us plug Jupyter/spark-shell (running in-cluster for now). Disabling it completely will create an unnecessary dependency on spark-submit which IMO is undesirable. We do want people to be able to programmatically construct a spark context that can point at a k8s cluster in their code I think. `kubernetes.default.svc` is the in-cluster kube-dns provisioned DNS address that should point to the API server - its availability is a good indicator that we're in a place that can address other pods - so, that can be used to detect when we don't want to let users try and fail client mode. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19843: [SPARK-22644][ML][TEST][WIP] Make ML testsuite support S...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19843 **[Test build #84637 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84637/testReport)** for PR 19843 at commit [`9629b31`](https://github.com/apache/spark/commit/9629b31e577a40f71d660d22f48b0b59489a3471). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19717: [SPARK-22646] [Submission] Spark on Kubernetes - ...
Github user yangw1234 commented on a diff in the pull request: https://github.com/apache/spark/pull/19717#discussion_r155707713 --- Diff: resource-managers/kubernetes/docker/src/main/dockerfiles/spark-base/Dockerfile --- @@ -0,0 +1,47 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +FROM openjdk:8-alpine --- End diff -- The base image **openjdk:8-alpine** uses **musl libc** instead of **glibc**, so certain library (like Intel MKL for native BLAS support in mllib) cannot run on the container. There is also report on openjdk compatibility issue on alpine (https://github.com/apache-spark-on-k8s/spark/issues/326). Should we change to a base image that supports glibc? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19717: [SPARK-22646] [Submission] Spark on Kubernetes - basic s...
Github user liyinan926 commented on the issue: https://github.com/apache/spark/pull/19717 @vanzin I implemented the enforcement in https://github.com/apache/spark/pull/19717/commits/44c40b1b7d546e4d214aff78c282c768aa0e3d42. PTAL. Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19852: [SPARK-22655][PySpark] Throw exception rather than exit ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19852 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19852: [SPARK-22655][PySpark] Throw exception rather than exit ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19852 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84633/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19852: [SPARK-22655][PySpark] Throw exception rather than exit ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19852 **[Test build #84633 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84633/testReport)** for PR 19852 at commit [`f606951`](https://github.com/apache/spark/commit/f606951b2fe2f74a0aa95055e17e479499c82e03). * This patch **fails from timeout after a configured wait of \`250m\`**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14151: [SPARK-16496][SQL] Add wholetext as option for reading t...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14151 **[Test build #84645 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84645/testReport)** for PR 14151 at commit [`7e91020`](https://github.com/apache/spark/commit/7e9102040ee9d6817288b5712e4c3c353390fba1). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14151: [SPARK-16496][SQL] Add wholetext as option for re...
Github user ScrapCodes commented on a diff in the pull request: https://github.com/apache/spark/pull/14151#discussion_r155706338 --- Diff: python/pyspark/sql/readwriter.py --- @@ -313,11 +313,16 @@ def text(self, paths): Each line in the text file is a new row in the resulting DataFrame. :param paths: string, or list of strings, for input path(s). +:param wholetext: if true, read each file from input path(s) as a single row. >>> df = spark.read.text('python/test_support/sql/text-test.txt') >>> df.collect() [Row(value=u'hello'), Row(value=u'this')] +>>> df = spark.read.text('python/test_support/sql/text-test.txt', wholetext=True) +>>> df.collect() +[Row(value=u'hello\nthis')] --- End diff -- That would fail the test, I suppose. I can give that a try though. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19769: [SPARK-12297][SQL] Adjust timezone for int96 data...
Github user ueshin commented on a diff in the pull request: https://github.com/apache/spark/pull/19769#discussion_r155706151 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala --- @@ -363,9 +370,25 @@ class ParquetFileFormat fileSplit.getLocations, null) + val sharedConf = broadcastedHadoopConf.value.value + // PARQUET_INT96_TIMESTAMP_CONVERSION says to apply timezone conversions to int96 timestamps' + // *only* if the file was created by something other than "parquet-mr", so check the actual + // writer here for this file. We have to do this per-file, as each file in the table may + // have different writers. + def isCreatedByParquetMr(): Boolean = { --- End diff -- Ah, that makes sense. Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19769: [SPARK-12297][SQL] Adjust timezone for int96 data...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/19769#discussion_r155705887 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala --- @@ -363,9 +370,25 @@ class ParquetFileFormat fileSplit.getLocations, null) + val sharedConf = broadcastedHadoopConf.value.value + // PARQUET_INT96_TIMESTAMP_CONVERSION says to apply timezone conversions to int96 timestamps' + // *only* if the file was created by something other than "parquet-mr", so check the actual + // writer here for this file. We have to do this per-file, as each file in the table may + // have different writers. + def isCreatedByParquetMr(): Boolean = { --- End diff -- I was thinking the same thing but then I realised this is intendedly `def` for short-circuiting for `timestampConversion && !isCreatedByParquetMr()` in L383. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19717: [SPARK-22646] [Submission] Spark on Kubernetes - basic s...
Github user liyinan926 commented on the issue: https://github.com/apache/spark/pull/19717 Checking the presence of `kubernetes.default.svc` is not sufficient as the driver service must exist. When the submission client is not involved, the driver itself must create that service before requesting the executor pods. I think we can go with a env var flag as @vanzin suggested. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19926: [SPARK-22733] Split StreamExecution into MicroBatchExecu...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19926 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84635/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19926: [SPARK-22733] Split StreamExecution into MicroBatchExecu...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19926 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19769: [SPARK-12297][SQL] Adjust timezone for int96 data from i...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19769 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84634/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19769: [SPARK-12297][SQL] Adjust timezone for int96 data from i...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19769 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19769: [SPARK-12297][SQL] Adjust timezone for int96 data from i...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19769 **[Test build #84634 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84634/testReport)** for PR 19769 at commit [`1185280`](https://github.com/apache/spark/commit/11852808e9fab0efa0ca7a4548e8ea68c42b6266). * This patch **fails SparkR unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19926: [SPARK-22733] Split StreamExecution into MicroBatchExecu...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19926 **[Test build #84635 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84635/testReport)** for PR 19926 at commit [`9f4daf6`](https://github.com/apache/spark/commit/9f4daf6b3eaeb413b803c3d0ec2ce0f43dc556a5). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19769: [SPARK-12297][SQL] Adjust timezone for int96 data...
Github user ueshin commented on a diff in the pull request: https://github.com/apache/spark/pull/19769#discussion_r155704230 --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java --- @@ -105,10 +112,19 @@ */ private final MemoryMode MEMORY_MODE; - public VectorizedParquetRecordReader(boolean useOffHeap) { + public VectorizedParquetRecordReader(TimeZone convertTz, boolean useOffHeap) { +this.convertTz = convertTz; MEMORY_MODE = useOffHeap ? MemoryMode.OFF_HEAP : MemoryMode.ON_HEAP; } + public VectorizedParquetRecordReader(boolean useOffHeap) { +this(null, useOffHeap); + } + + VectorizedParquetRecordReader() { --- End diff -- What is this constructor for? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19769: [SPARK-12297][SQL] Adjust timezone for int96 data...
Github user ueshin commented on a diff in the pull request: https://github.com/apache/spark/pull/19769#discussion_r155703789 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala --- @@ -363,9 +370,25 @@ class ParquetFileFormat fileSplit.getLocations, null) + val sharedConf = broadcastedHadoopConf.value.value + // PARQUET_INT96_TIMESTAMP_CONVERSION says to apply timezone conversions to int96 timestamps' + // *only* if the file was created by something other than "parquet-mr", so check the actual + // writer here for this file. We have to do this per-file, as each file in the table may + // have different writers. + def isCreatedByParquetMr(): Boolean = { --- End diff -- nit: Do we need to define this as `def` instead of `val`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19769: [SPARK-12297][SQL] Adjust timezone for int96 data...
Github user ueshin commented on a diff in the pull request: https://github.com/apache/spark/pull/19769#discussion_r155703855 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala --- @@ -363,9 +370,25 @@ class ParquetFileFormat fileSplit.getLocations, null) + val sharedConf = broadcastedHadoopConf.value.value + // PARQUET_INT96_TIMESTAMP_CONVERSION says to apply timezone conversions to int96 timestamps' + // *only* if the file was created by something other than "parquet-mr", so check the actual + // writer here for this file. We have to do this per-file, as each file in the table may + // have different writers. + def isCreatedByParquetMr(): Boolean = { +val footer = ParquetFileReader.readFooter(sharedConf, fileSplit.getPath, SKIP_ROW_GROUPS) +footer.getFileMetaData().getCreatedBy().startsWith("parquet-mr") + } + val convertTz = +if (timestampConversion && !isCreatedByParquetMr()) { + Some(DateTimeUtils.getTimeZone(sharedConf.get(SQLConf.SESSION_LOCAL_TIMEZONE.key))) +} else { + None +} + val attemptId = new TaskAttemptID(new TaskID(new JobID(), TaskType.MAP, 0), 0) val hadoopAttemptContext = -new TaskAttemptContextImpl(broadcastedHadoopConf.value.value, attemptId) +new TaskAttemptContextImpl(broadcastedHadoopConf.value.value, attemptId); --- End diff -- nit: we don't need `;` at the end of this line. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19769: [SPARK-12297][SQL] Adjust timezone for int96 data...
Github user ueshin commented on a diff in the pull request: https://github.com/apache/spark/pull/19769#discussion_r155704317 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala --- @@ -307,6 +308,10 @@ class ParquetFileFormat hadoopConf.set( ParquetWriteSupport.SPARK_ROW_SCHEMA, requiredSchema.json) +hadoopConf.set( + SQLConf.SESSION_LOCAL_TIMEZONE.key, + sparkSession.sessionState.conf.sessionLocalTimeZone +) --- End diff -- nit: we can join this line to the previous line. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19525: [SPARK-22289] [ML] Add JSON support for Matrix parameter...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19525 **[Test build #84644 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84644/testReport)** for PR 19525 at commit [`33e08ee`](https://github.com/apache/spark/commit/33e08ee22f2e37f47da246ce197f699351a3bc1b). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19811: [SPARK-18016][SQL] Code Generation: Constant Pool Limit ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19811 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84638/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19811: [SPARK-18016][SQL] Code Generation: Constant Pool Limit ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19811 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19811: [SPARK-18016][SQL] Code Generation: Constant Pool Limit ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19811 **[Test build #84638 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84638/testReport)** for PR 19811 at commit [`2ea1d35`](https://github.com/apache/spark/commit/2ea1d35972b8b6a5b6ca7a74bf7e8ceccc857536). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19852: [SPARK-22655][PySpark] Throw exception rather than exit ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19852 **[Test build #84643 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84643/testReport)** for PR 19852 at commit [`f606951`](https://github.com/apache/spark/commit/f606951b2fe2f74a0aa95055e17e479499c82e03). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19852: [SPARK-22655][PySpark] Throw exception rather than exit ...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/19852 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19872: WIP: [SPARK-22274][PySpark] User-defined aggregation fun...
Github user ueshin commented on the issue: https://github.com/apache/spark/pull/19872 @icexelloss I'm sorry for the late response. Actually I tried to implement prototypes of Pandas UDAF with partial aggregation and combining existing aggregate functions, but they are still much complicated (https://github.com/ueshin/apache-spark/pull/2, https://github.com/ueshin/apache-spark/pull/3, https://github.com/ueshin/apache-spark/pull/4). I was thinking about easier way to achieve that but not yet. I've not looked into this pr yet but I guess we can start this pr and pick some functionalities from my prototypes if needed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19927: [WIP] OVR transform optimization
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19927 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84640/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19927: [WIP] OVR transform optimization
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19927 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19927: [WIP] OVR transform optimization
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19927 **[Test build #84640 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84640/testReport)** for PR 19927 at commit [`b95795f`](https://github.com/apache/spark/commit/b95795fefce3dcb0bc3b91cbf70f09d7747affd1). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19852: [SPARK-22655][PySpark] Throw exception rather tha...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/19852#discussion_r155701675 --- Diff: core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala --- @@ -317,10 +317,6 @@ private[spark] abstract class BasePythonRunner[IN, OUT]( logDebug("Exception thrown after task interruption", e) throw new TaskKilledException(context.getKillReason().getOrElse("unknown reason")) - case e: Exception if env.isStopped => --- End diff -- Note to myself: this was added in https://github.com/apache/spark/pull/2838 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19591: [SPARK-11035][core] Add in-process Spark app launcher.
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19591 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19591: [SPARK-11035][core] Add in-process Spark app launcher.
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19591 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84629/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19591: [SPARK-11035][core] Add in-process Spark app launcher.
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19591 **[Test build #84629 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84629/testReport)** for PR 19591 at commit [`8496024`](https://github.com/apache/spark/commit/84960244458045810f0066256fbbee2446a3071c). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19927: [WIP] OVR transform optimization
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/19927 test code: ``` import org.apache.spark.ml.classification._ val df = spark.read.format("libsvm").load("/Users/zrf/Dev/OpenSource/spark/data/mllib/sample_multiclass_classification_data.txt") val classifier = new LogisticRegression().setMaxIter(1).setTol(1E-6).setFitIntercept(true) val ovr = new OneVsRest().setClassifier(classifier) var df2 = df (0 until 10).foreach{ i => df2 = df2.union(df2) } df2 = df2.repartition(64) df2.persist() df2.count Seq(3, 5, 7, 10, 20).foreach { k => val dfk = df2.withColumn("label", (rand()*k).cast("int")) val ovrModelK = ovr.fit(dfk) val start = System.nanoTime (0 until 10).foreach { i => ovrModelK.transform(dfk).count } val end = System.nanoTime val duration = (end - start) / 1e9 println(s"numClasses $k, duration $duration") } ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19904: [SPARK-22707][ML] Optimize CrossValidator memory occupat...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19904 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19904: [SPARK-22707][ML] Optimize CrossValidator memory occupat...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19904 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84639/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19904: [SPARK-22707][ML] Optimize CrossValidator memory occupat...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19904 **[Test build #84639 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84639/testReport)** for PR 19904 at commit [`2cc7c28`](https://github.com/apache/spark/commit/2cc7c28f385009570536690d686f2843485942b2). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19813: [SPARK-22600][SQL] Fix 64kb limit for deeply nested expr...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19813 **[Test build #84642 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84642/testReport)** for PR 19813 at commit [`9443011`](https://github.com/apache/spark/commit/9443011978c32c611e950a6193f05aa666437f50). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19852: [SPARK-22655][PySpark] Throw exception rather than exit ...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/19852 LGTM --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19813: [SPARK-22600][SQL] Fix 64kb limit for deeply nested expr...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19813 **[Test build #84641 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84641/testReport)** for PR 19813 at commit [`7fc3e04`](https://github.com/apache/spark/commit/7fc3e04b9894214e67227beeb14165610e4887d2). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19813: [SPARK-22600][SQL] Fix 64kb limit for deeply nest...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19813#discussion_r155698248 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/ExpressionCodegen.scala --- @@ -0,0 +1,237 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.expressions.codegen + +import scala.collection.mutable + +import org.apache.spark.sql.catalyst.expressions._ + +/** + * Defines util methods used in expression code generation. + */ +object ExpressionCodegen { + + /** + * Given an expression, returns the all necessary parameters to evaluate it, so the generated + * code of this expression can be split in a function. + * The 1st string in returned tuple is the parameter strings used to call the function. + * The 2nd string in returned tuple is the parameter strings used to declare the function. + * + * Returns `None` if it can't produce valid parameters. + * + * Params to include: + * 1. Evaluated columns referred by this, children or deferred expressions. + * 2. Rows referred by this, children or deferred expressions. + * 3. Eliminated subexpressions referred bu children expressions. + */ + def getExpressionInputParams( + ctx: CodegenContext, + expr: Expression): Option[(Seq[String], Seq[String])] = { +val (inputAttrs, inputVars) = getInputVarsForChildren(ctx, expr) +val paramsFromColumns = prepareFunctionParams(ctx, inputAttrs, inputVars) + +val subExprs = getSubExprInChildren(ctx, expr) +val subExprCodes = getSubExprCodes(ctx, subExprs) +val paramsFromSubExprs = prepareFunctionParams(ctx, subExprs, subExprCodes) + +val inputRows = ctx.INPUT_ROW +: getInputRowsForChildren(ctx, expr) +val paramsFromRows = inputRows.distinct.filter(_ != null).map { row => + (row, s"InternalRow $row") +} + +val paramsLength = getParamLength(ctx, inputAttrs, subExprs) + paramsFromRows.length +// Maximum allowed parameter number for Java's method descriptor. +if (paramsLength > 255) { --- End diff -- Since we split functions out if the code length is large enough, I think we won't possibly hit this limit. This is more like a safety guard. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19923: [SPARK-22721] BytesToBytesMap peak memory not updated.
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19923 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19923: [SPARK-22721] BytesToBytesMap peak memory not updated.
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19923 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84627/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19923: [SPARK-22721] BytesToBytesMap peak memory not updated.
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19923 **[Test build #84627 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84627/testReport)** for PR 19923 at commit [`12de708`](https://github.com/apache/spark/commit/12de708a3ba894b9568be12402f561b767355acc). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19927: [WIP] OVR transform optimization
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19927 **[Test build #84640 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84640/testReport)** for PR 19927 at commit [`b95795f`](https://github.com/apache/spark/commit/b95795fefce3dcb0bc3b91cbf70f09d7747affd1). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18995: [SPARK-21787][SQL] Support for pushing down filters for ...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/18995 Thank you so much, @cloud-fan ! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19499: [SPARK-22279][SQL] Turn on spark.sql.hive.convertMetasto...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/19499 Thank you so much, @gatorsmile ! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19927: [WIP] OVR transform optimization
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19927 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19927: [WIP] OVR transform optimization
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19927 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84636/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19927: [WIP] OVR transform optimization
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19927 **[Test build #84636 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84636/testReport)** for PR 19927 at commit [`45f70c8`](https://github.com/apache/spark/commit/45f70c87822b7c4bd872c0729509eeed0a2f9441). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19904: [SPARK-22707][ML] Optimize CrossValidator memory ...
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/19904#discussion_r155695280 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala --- @@ -146,31 +146,34 @@ class CrossValidator @Since("1.2.0") (@Since("1.4.0") override val uid: String) val validationDataset = sparkSession.createDataFrame(validation, schema).cache() logDebug(s"Train split $splitIndex with multiple sets of parameters.") + var completeFitCount = 0 + val signal = new Object // Fit models in a Future for training in parallel - val modelFutures = epm.zipWithIndex.map { case (paramMap, paramIndex) => -Future[Model[_]] { + val foldMetricFutures = epm.zipWithIndex.map { case (paramMap, paramIndex) => +Future[Double] { val model = est.fit(trainingDataset, paramMap).asInstanceOf[Model[_]] + signal.synchronized { +completeFitCount += 1 +signal.notify() + } if (collectSubModelsParam) { subModels.get(splitIndex)(paramIndex) = model } - model -} (executionContext) - } - - // Unpersist training data only when all models have trained - Future.sequence[Model[_], Iterable](modelFutures)(implicitly, executionContext) -.onComplete { _ => trainingDataset.unpersist() } (executionContext) - - // Evaluate models in a Future that will calulate a metric and allow model to be cleaned up - val foldMetricFutures = modelFutures.zip(epm).map { case (modelFuture, paramMap) => -modelFuture.map { model => // TODO: duplicate evaluator to take extra params from input val metric = eval.evaluate(model.transform(validationDataset, paramMap)) logDebug(s"Got metric $metric for model trained with $paramMap.") metric } (executionContext) } + Future { +signal.synchronized { + while (completeFitCount < epm.length) { --- End diff -- Ah! your idea is really good. We don't need any "wait thread" to do this unpersist. Your solution will be simpler. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19904: [SPARK-22707][ML] Optimize CrossValidator memory occupat...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19904 **[Test build #84639 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84639/testReport)** for PR 19904 at commit [`2cc7c28`](https://github.com/apache/spark/commit/2cc7c28f385009570536690d686f2843485942b2). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19921: [SPARK-22452][SQL] Add getDouble to DataSourceV2Options
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19921 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19921: [SPARK-22452][SQL] Add getDouble to DataSourceV2Options
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19921 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84626/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19921: [SPARK-22452][SQL] Add getDouble to DataSourceV2Options
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19921 **[Test build #84626 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84626/testReport)** for PR 19921 at commit [`2d4c2b0`](https://github.com/apache/spark/commit/2d4c2b0bdf89badf25f2d1d98903125e48e7cd5c). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19919: [SPARK-22727] spark.executor.instances's default value s...
Github user liu-zhaokun commented on the issue: https://github.com/apache/spark/pull/19919 @srowen If the default is not correct,I can fix it in this PR by the way. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19861: [SPARK-22387][SQL] Propagate session configs to d...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/19861#discussion_r155693977 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2ConfigSupport.scala --- @@ -0,0 +1,85 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources.v2 + +import java.util.regex.Pattern + +import scala.collection.JavaConverters._ +import scala.collection.immutable + +import org.apache.spark.internal.Logging +import org.apache.spark.sql.internal.SQLConf +import org.apache.spark.sql.sources.v2.ConfigSupport + +private[sql] object DataSourceV2ConfigSupport extends Logging { + + /** + * Helper method to propagate session configs with config key that matches at least one of the + * given prefixes to the corresponding data source options. + * + * @param cs the session config propagate help class + * @param source the data source format + * @param conf the session conf + * @return an immutable map that contains all the session configs that should be propagated to + * the data source. + */ + def withSessionConfig( + cs: ConfigSupport, + source: String, + conf: SQLConf): immutable.Map[String, String] = { +val prefixes = cs.getConfigPrefixes +require(prefixes != null, "The config key-prefixes cann't be null.") +val mapping = cs.getConfigMapping.asScala +val validOptions = cs.getValidOptions +require(validOptions != null, "The valid options list cann't be null.") --- End diff -- double n --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19861: [SPARK-22387][SQL] Propagate session configs to d...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/19861#discussion_r155693966 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2ConfigSupport.scala --- @@ -0,0 +1,85 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources.v2 + +import java.util.regex.Pattern + +import scala.collection.JavaConverters._ +import scala.collection.immutable + +import org.apache.spark.internal.Logging +import org.apache.spark.sql.internal.SQLConf +import org.apache.spark.sql.sources.v2.ConfigSupport + +private[sql] object DataSourceV2ConfigSupport extends Logging { + + /** + * Helper method to propagate session configs with config key that matches at least one of the + * given prefixes to the corresponding data source options. + * + * @param cs the session config propagate help class + * @param source the data source format + * @param conf the session conf + * @return an immutable map that contains all the session configs that should be propagated to + * the data source. + */ + def withSessionConfig( + cs: ConfigSupport, + source: String, + conf: SQLConf): immutable.Map[String, String] = { +val prefixes = cs.getConfigPrefixes +require(prefixes != null, "The config key-prefixes cann't be null.") --- End diff -- double n --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19919: [SPARK-22727] spark.executor.instances's default value s...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/19919 Hm, I thought the default would appear in `.../config/package.scala`, but I don't see it. I'm not actually sure it defaults to 2 anywhere (?) . (@vanzin does this ring a bell?) I don't think these changes should be made, but I'm also wondering if the docs are correct that the default is 2. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19811: [SPARK-18016][SQL] Code Generation: Constant Pool Limit ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19811 **[Test build #84638 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84638/testReport)** for PR 19811 at commit [`2ea1d35`](https://github.com/apache/spark/commit/2ea1d35972b8b6a5b6ca7a74bf7e8ceccc857536). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19904: [SPARK-22707][ML] Optimize CrossValidator memory ...
Github user MrBago commented on a diff in the pull request: https://github.com/apache/spark/pull/19904#discussion_r155693144 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala --- @@ -146,31 +146,34 @@ class CrossValidator @Since("1.2.0") (@Since("1.4.0") override val uid: String) val validationDataset = sparkSession.createDataFrame(validation, schema).cache() logDebug(s"Train split $splitIndex with multiple sets of parameters.") + var completeFitCount = 0 + val signal = new Object // Fit models in a Future for training in parallel - val modelFutures = epm.zipWithIndex.map { case (paramMap, paramIndex) => -Future[Model[_]] { + val foldMetricFutures = epm.zipWithIndex.map { case (paramMap, paramIndex) => +Future[Double] { val model = est.fit(trainingDataset, paramMap).asInstanceOf[Model[_]] + signal.synchronized { +completeFitCount += 1 +signal.notify() + } if (collectSubModelsParam) { subModels.get(splitIndex)(paramIndex) = model } - model -} (executionContext) - } - - // Unpersist training data only when all models have trained - Future.sequence[Model[_], Iterable](modelFutures)(implicitly, executionContext) -.onComplete { _ => trainingDataset.unpersist() } (executionContext) - - // Evaluate models in a Future that will calulate a metric and allow model to be cleaned up - val foldMetricFutures = modelFutures.zip(epm).map { case (modelFuture, paramMap) => -modelFuture.map { model => // TODO: duplicate evaluator to take extra params from input val metric = eval.evaluate(model.transform(validationDataset, paramMap)) logDebug(s"Got metric $metric for model trained with $paramMap.") metric } (executionContext) } + Future { +signal.synchronized { + while (completeFitCount < epm.length) { --- End diff -- Sorry I'm not too familiar with Futures in Scala. Is it save to create a blocking future like this, do you risk starving the thread pool? Can we just just an if statement in the `synchronized` block above? something like: ``` completeFitCount += 1 if (completeFitCount == epm.length) { trainingDataset.unpersist() } ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19843: [SPARK-22644][ML][TEST][WIP] Make ML testsuite support S...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19843 **[Test build #84637 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84637/testReport)** for PR 19843 at commit [`9629b31`](https://github.com/apache/spark/commit/9629b31e577a40f71d660d22f48b0b59489a3471). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/19783#discussion_r155692778 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/FilterEstimationSuite.scala --- @@ -359,7 +371,7 @@ class FilterEstimationSuite extends StatsEstimationTestBase { test("cbool > false") { validateEstimatedStats( Filter(GreaterThan(attrBool, Literal(false)), childStatsTestPlan(Seq(attrBool), 10L)), - Seq(attrBool -> ColumnStat(distinctCount = 1, min = Some(true), max = Some(true), + Seq(attrBool -> ColumnStat(distinctCount = 1, min = Some(false), max = Some(true), --- End diff -- That may need special code path for boolean type, but IMHO I don't think it deserves the complexity. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19927: [WIP] OVR transform optimization
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19927 **[Test build #84636 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84636/testReport)** for PR 19927 at commit [`45f70c8`](https://github.com/apache/spark/commit/45f70c87822b7c4bd872c0729509eeed0a2f9441). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org