[GitHub] [spark] zhengruifeng commented on pull request #40402: [SPARK-42020][CONNECT][PYTHON] Support UserDefinedType in Spark Connect
zhengruifeng commented on PR #40402: URL: https://github.com/apache/spark/pull/40402#issuecomment-1467212310 also cc @WeichenXu123 since this PR supports `df.collect` with UDT -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] panbingkun commented on a diff in pull request #40394: [SPARK-42771][SQL] Refactor HiveGenericUDF
panbingkun commented on code in PR #40394: URL: https://github.com/apache/spark/pull/40394#discussion_r1134807339 ## sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveUDFs.scala: ## @@ -235,6 +200,56 @@ private[hive] case class HiveGenericUDF( } } +class HiveGenericUDFHelper( +funcWrapper: HiveFunctionWrapper, children: Seq[Expression]) + extends HiveInspectors + with Serializable + with Logging { + + @transient + private[hive] val deterministic = isUDFDeterministic && children.forall(_.deterministic) + + @transient + private[hive] val foldable = Review Comment: If we do not define `dataType` `foldable` `deterministic` properties in this class, we will have to expose some internal logic, such as `returnInspector` https://user-images.githubusercontent.com/15246973/224871989-7be999b4-c229-44d1-9e53-658ecacc539c.png;> With the similar implementation of HiveSimpleUDF, I found some common properties, or I can do some abstraction and reuse later. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on a diff in pull request #40097: [SPARK-42508][CONNECT][ML] Extract the common .ml classes to `mllib-common`
zhengruifeng commented on code in PR #40097: URL: https://github.com/apache/spark/pull/40097#discussion_r1134836246 ## mllib/core/src/main/scala/org/apache/spark/ml/param/shared/HasExecutionContext.scala: ## @@ -0,0 +1,40 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.param.shared + +import scala.concurrent.ExecutionContext + +import org.apache.spark.util.ThreadUtils + +private[ml] trait HasExecutionContext extends HasParallelism { + + /** + * Create a new execution context with a thread-pool that has a maximum number of threads + * set to the value of [[parallelism]]. If this param is set to 1, a same-thread executor + * will be used to run in serial. + */ + private[ml] def getExecutionContext: ExecutionContext = { +getParallelism match { + case 1 => +ThreadUtils.sameThread + case n => +ExecutionContext.fromExecutorService(ThreadUtils + .newDaemonCachedThreadPool(s"${this.getClass.getSimpleName}-thread-pool", n)) +} + } Review Comment: it will make `mllib-common` directly depends on spark core. Let me take another look -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sadikovi commented on a diff in pull request #40396: [SPARK-42772][SQL] Change the default value of JDBC options about push down to true
sadikovi commented on code in PR #40396: URL: https://github.com/apache/spark/pull/40396#discussion_r1134891054 ## connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/PostgresIntegrationSuite.scala: ## @@ -49,10 +49,6 @@ class PostgresIntegrationSuite extends DockerJDBCIntegrationV2Suite with V2JDBCT override def sparkConf: SparkConf = super.sparkConf .set("spark.sql.catalog.postgresql", classOf[JDBCTableCatalog].getName) .set("spark.sql.catalog.postgresql.url", db.getJdbcUrl(dockerIp, externalPort)) -.set("spark.sql.catalog.postgresql.pushDownTableSample", "true") Review Comment: IMHO, I would still keep those configs in the tests, makes it clear that the configs need to be enabled in the test but it is fine either way for me. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] itholic commented on pull request #40282: [SPARK-42672][PYTHON][DOCS] Document error class list
itholic commented on PR #40282: URL: https://github.com/apache/spark/pull/40282#issuecomment-1467350642 Reminder for @HyukjinKwon @srielau @MaxGekk for error class document for PySpark. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #40395: [SPARK-42770][CONNECT] Add `truncatedTo(ChronoUnit.MICROS)` to make `SQLImplicitsTestSuite` in Java 17 daily test GA task pa
dongjoon-hyun commented on code in PR #40395: URL: https://github.com/apache/spark/pull/40395#discussion_r1134938946 ## connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/SQLImplicitsTestSuite.scala: ## @@ -130,9 +131,15 @@ class SQLImplicitsTestSuite extends ConnectFunSuite with BeforeAndAfterAll { testImplicit(BigDecimal(decimal)) testImplicit(Date.valueOf(LocalDate.now())) testImplicit(LocalDate.now()) -testImplicit(LocalDateTime.now()) -testImplicit(Instant.now()) -testImplicit(Timestamp.from(Instant.now())) +// SPARK-42770: Run `LocalDateTime.now()` and `Instant.now()` with Java 8 & 11 always +// get microseconds on both Linux and MacOS, but there are some differences when +// using Java 17, it will get accurate nanoseconds on Linux, but still get the microseconds +// on MacOS. At present, Spark always converts them to microseconds, this will cause the +// test fail when using Java 17 on Linux, so add `truncatedTo(ChronoUnit.MICROS)` to +// there to ensure the accuracy of input data is microseconds. +testImplicit(LocalDateTime.now().truncatedTo(ChronoUnit.MICROS)) +testImplicit(Instant.now().truncatedTo(ChronoUnit.MICROS)) +testImplicit(Timestamp.from(Instant.now().truncatedTo(ChronoUnit.MICROS))) Review Comment: Shall we apply this change for Java 17 only? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on a diff in pull request #40395: [SPARK-42770][CONNECT] Add `truncatedTo(ChronoUnit.MICROS)` to make `SQLImplicitsTestSuite` in Java 17 daily test GA task pass
LuciferYang commented on code in PR #40395: URL: https://github.com/apache/spark/pull/40395#discussion_r1135005783 ## connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/SQLImplicitsTestSuite.scala: ## @@ -130,9 +132,21 @@ class SQLImplicitsTestSuite extends ConnectFunSuite with BeforeAndAfterAll { testImplicit(BigDecimal(decimal)) testImplicit(Date.valueOf(LocalDate.now())) testImplicit(LocalDate.now()) -testImplicit(LocalDateTime.now()) -testImplicit(Instant.now()) -testImplicit(Timestamp.from(Instant.now())) +// SPARK-42770: Run `LocalDateTime.now()` and `Instant.now()` with Java 8 & 11 always +// get microseconds on both Linux and MacOS, but there are some differences when +// using Java 17, it will get accurate nanoseconds on Linux, but still get the microseconds +// on MacOS. At present, Spark always converts them to microseconds, this will cause the +// test fail when using Java 17 on Linux, so add `truncatedTo(ChronoUnit.MICROS)` when +// testing on Linux using Java 17 to ensure the accuracy of input data is microseconds. +if (SystemUtils.isJavaVersionAtLeast(JavaVersion.JAVA_17) && SystemUtils.IS_OS_LINUX) { Review Comment: Checked CentOS+Java 17, it works -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #18990: [SPARK-21782][Core] Repartition creates skews when numPartitions is a power of 2
cloud-fan commented on PR #18990: URL: https://github.com/apache/spark/pull/18990#issuecomment-1467186918 It should have been fixed in 3.2+: https://github.com/apache/spark/pull/37855 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] wangshengjie123 commented on pull request #40391: [SPARK-42766][YARN] YarnAllocator filter excluded nodes when launching containers
wangshengjie123 commented on PR #40391: URL: https://github.com/apache/spark/pull/40391#issuecomment-1467201970 @Ngone51 @tgravescs could you please help review this pr when you have time, thanks. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] panbingkun commented on a diff in pull request #40394: [SPARK-42771][SQL] Refactor HiveGenericUDF
panbingkun commented on code in PR #40394: URL: https://github.com/apache/spark/pull/40394#discussion_r1134810496 ## sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveUDFs.scala: ## @@ -235,6 +200,56 @@ private[hive] case class HiveGenericUDF( } } +class HiveGenericUDFHelper( +funcWrapper: HiveFunctionWrapper, children: Seq[Expression]) + extends HiveInspectors + with Serializable + with Logging { + + @transient + private[hive] val deterministic = isUDFDeterministic && children.forall(_.deterministic) + + @transient + private[hive] val foldable = +isUDFDeterministic && returnInspector.isInstanceOf[ConstantObjectInspector] + + @transient + private[hive] val dataType: DataType = inspectorToDataType(returnInspector) + + @transient + private lazy val function = funcWrapper.createFunction[GenericUDF]() + + @transient + private lazy val isUDFDeterministic = { +val udfType = function.getClass.getAnnotation(classOf[HiveUDFType]) +udfType != null && udfType.deterministic() && !udfType.stateful() + } + + @transient + private lazy val argumentInspectors = children.map(toInspector) + + @transient + private lazy val deferredObjects: Array[DeferredObject] = argumentInspectors.zip(children).map { Review Comment: Ok -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #40395: [SPARK-42770][CONNECT] Add `truncatedTo(ChronoUnit.MICROS)` to make `SQLImplicitsTestSuite` in Java 17 daily test GA task pa
dongjoon-hyun commented on code in PR #40395: URL: https://github.com/apache/spark/pull/40395#discussion_r1134941041 ## connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/SQLImplicitsTestSuite.scala: ## @@ -130,9 +131,15 @@ class SQLImplicitsTestSuite extends ConnectFunSuite with BeforeAndAfterAll { testImplicit(BigDecimal(decimal)) testImplicit(Date.valueOf(LocalDate.now())) testImplicit(LocalDate.now()) -testImplicit(LocalDateTime.now()) -testImplicit(Instant.now()) -testImplicit(Timestamp.from(Instant.now())) +// SPARK-42770: Run `LocalDateTime.now()` and `Instant.now()` with Java 8 & 11 always +// get microseconds on both Linux and MacOS, but there are some differences when +// using Java 17, it will get accurate nanoseconds on Linux, but still get the microseconds +// on MacOS. At present, Spark always converts them to microseconds, this will cause the +// test fail when using Java 17 on Linux, so add `truncatedTo(ChronoUnit.MICROS)` to +// there to ensure the accuracy of input data is microseconds. +testImplicit(LocalDateTime.now().truncatedTo(ChronoUnit.MICROS)) +testImplicit(Instant.now().truncatedTo(ChronoUnit.MICROS)) +testImplicit(Timestamp.from(Instant.now().truncatedTo(ChronoUnit.MICROS))) Review Comment: Also, it would be great if we can apply new code on Linux environment only, @LuciferYang . -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] WeichenXu123 commented on a diff in pull request #40097: [SPARK-42508][CONNECT][ML] Extract the common .ml classes to `mllib-common`
WeichenXu123 commented on code in PR #40097: URL: https://github.com/apache/spark/pull/40097#discussion_r1134960384 ## mllib/core/src/test/scala/org/apache/spark/ml/attribute/AttributeGroupSuite.scala: ## @@ -1,65 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - *http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -package org.apache.spark.ml.attribute - -import org.apache.spark.SparkFunSuite - -class AttributeGroupSuite extends SparkFunSuite { - - test("attribute group") { -val attrs = Array( - NumericAttribute.defaultAttr, - NominalAttribute.defaultAttr, - BinaryAttribute.defaultAttr.withIndex(0), - NumericAttribute.defaultAttr.withName("age").withSparsity(0.8), - NominalAttribute.defaultAttr.withName("size").withValues("small", "medium", "large"), - BinaryAttribute.defaultAttr.withName("clicked").withValues("no", "yes"), - NumericAttribute.defaultAttr, - NumericAttribute.defaultAttr) -val group = new AttributeGroup("user", attrs) -assert(group.size === 8) -assert(group.name === "user") -assert(group(0) === NumericAttribute.defaultAttr.withIndex(0)) -assert(group(2) === BinaryAttribute.defaultAttr.withIndex(2)) -assert(group.indexOf("age") === 3) -assert(group.indexOf("size") === 4) -assert(group.indexOf("clicked") === 5) -assert(!group.hasAttr("abc")) -intercept[NoSuchElementException] { - group("abc") -} -assert(group === AttributeGroup.fromMetadata(group.toMetadataImpl, group.name)) -assert(group === AttributeGroup.fromStructField(group.toStructField())) - } - - test("attribute group without attributes") { -val group0 = new AttributeGroup("user", 10) -assert(group0.name === "user") -assert(group0.numAttributes === Some(10)) -assert(group0.size === 10) -assert(group0.attributes.isEmpty) -assert(group0 === AttributeGroup.fromMetadata(group0.toMetadataImpl, group0.name)) -assert(group0 === AttributeGroup.fromStructField(group0.toStructField())) - -val group1 = new AttributeGroup("item") -assert(group1.name === "item") -assert(group1.numAttributes.isEmpty) -assert(group1.attributes.isEmpty) -assert(group1.size === -1) - } -} Review Comment: Similar question for `AttributeSuite.scala` , `JsonMatrixConverterSuite.scala` , .. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] WeichenXu123 commented on a diff in pull request #40097: [SPARK-42508][CONNECT][ML] Extract the common .ml classes to `mllib-common`
WeichenXu123 commented on code in PR #40097: URL: https://github.com/apache/spark/pull/40097#discussion_r1134959146 ## mllib/core/src/test/scala/org/apache/spark/ml/attribute/AttributeGroupSuite.scala: ## @@ -1,65 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - *http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -package org.apache.spark.ml.attribute - -import org.apache.spark.SparkFunSuite - -class AttributeGroupSuite extends SparkFunSuite { - - test("attribute group") { -val attrs = Array( - NumericAttribute.defaultAttr, - NominalAttribute.defaultAttr, - BinaryAttribute.defaultAttr.withIndex(0), - NumericAttribute.defaultAttr.withName("age").withSparsity(0.8), - NominalAttribute.defaultAttr.withName("size").withValues("small", "medium", "large"), - BinaryAttribute.defaultAttr.withName("clicked").withValues("no", "yes"), - NumericAttribute.defaultAttr, - NumericAttribute.defaultAttr) -val group = new AttributeGroup("user", attrs) -assert(group.size === 8) -assert(group.name === "user") -assert(group(0) === NumericAttribute.defaultAttr.withIndex(0)) -assert(group(2) === BinaryAttribute.defaultAttr.withIndex(2)) -assert(group.indexOf("age") === 3) -assert(group.indexOf("size") === 4) -assert(group.indexOf("clicked") === 5) -assert(!group.hasAttr("abc")) -intercept[NoSuchElementException] { - group("abc") -} -assert(group === AttributeGroup.fromMetadata(group.toMetadataImpl, group.name)) -assert(group === AttributeGroup.fromStructField(group.toStructField())) - } - - test("attribute group without attributes") { -val group0 = new AttributeGroup("user", 10) -assert(group0.name === "user") -assert(group0.numAttributes === Some(10)) -assert(group0.size === 10) -assert(group0.attributes.isEmpty) -assert(group0 === AttributeGroup.fromMetadata(group0.toMetadataImpl, group0.name)) -assert(group0 === AttributeGroup.fromStructField(group0.toStructField())) - -val group1 = new AttributeGroup("item") -assert(group1.name === "item") -assert(group1.numAttributes.isEmpty) -assert(group1.attributes.isEmpty) -assert(group1.size === -1) - } -} Review Comment: Q: This file should be moved to ml-common, but why the git diff does not show it as a "a --> b" ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ueshin commented on pull request #40388: [SPARK-42765][CONNECT][PYTHON] Regulate the import path of `pandas_udf`
ueshin commented on PR #40388: URL: https://github.com/apache/spark/pull/40388#issuecomment-1467126454 Btw, what happens if `"PYSPARK_NO_NAMESPACE_SHARE" in os.environ`? https://github.com/apache/spark/blob/761e0c0f6f0d00733177a869b9ecdf454e13fc9f/python/pyspark/sql/utils.py#L148-L161 Usually the env var is set, we need to explicitly import `pyspark.sql.connect.function`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] liang3zy22 closed pull request #40347: [SPARK-42711][BUILD]Update usage info and shellcheck warn/error fix for build/sbt tool
liang3zy22 closed pull request #40347: [SPARK-42711][BUILD]Update usage info and shellcheck warn/error fix for build/sbt tool URL: https://github.com/apache/spark/pull/40347 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] liang3zy22 commented on pull request #40347: [SPARK-42711][BUILD]Update usage info and shellcheck warn/error fix for build/sbt tool
liang3zy22 commented on PR #40347: URL: https://github.com/apache/spark/pull/40347#issuecomment-1467133558 Yes, this PR is kind of useless. I close it now. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] WeichenXu123 commented on a diff in pull request #40097: [SPARK-42508][CONNECT][ML] Extract the common .ml classes to `mllib-common`
WeichenXu123 commented on code in PR #40097: URL: https://github.com/apache/spark/pull/40097#discussion_r1134787382 ## connector/connect/client/jvm/src/main/scala/org/apache/spark/ml/Estimator.scala: ## @@ -0,0 +1,97 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml + +import scala.annotation.varargs + +import org.apache.spark.annotation.Since +import org.apache.spark.ml.param.{ParamMap, ParamPair} +import org.apache.spark.sql.Dataset + +/** + * Abstract class for estimators that fit models to data. + */ +abstract class Estimator[M <: Model[M]] extends PipelineStage { Review Comment: Similar question for Classifier, ProbabilisticClassifier -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #40394: [SPARK-42771][SQL] Refactor HiveGenericUDF
cloud-fan commented on code in PR #40394: URL: https://github.com/apache/spark/pull/40394#discussion_r1134829857 ## sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveUDFs.scala: ## @@ -191,18 +157,18 @@ private[hive] case class HiveGenericUDF( override protected def withNewChildrenInternal(newChildren: IndexedSeq[Expression]): Expression = copy(children = newChildren) - override protected def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = { + + protected def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = { val refTerm = ctx.addReferenceObj("this", this) Review Comment: do we still reference the entire `this`? Can we just reference the new evalautor? ## sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveUDFs.scala: ## @@ -191,18 +157,18 @@ private[hive] case class HiveGenericUDF( override protected def withNewChildrenInternal(newChildren: IndexedSeq[Expression]): Expression = copy(children = newChildren) - override protected def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = { + + protected def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = { val refTerm = ctx.addReferenceObj("this", this) Review Comment: do we still reference the entire `this`? Can we just reference the new evaluator? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng closed pull request #40401: [SPARK-42773][DOCS][PYTHON] Minor update to 3.4.0 version change message for Spark Connect
zhengruifeng closed pull request #40401: [SPARK-42773][DOCS][PYTHON] Minor update to 3.4.0 version change message for Spark Connect URL: https://github.com/apache/spark/pull/40401 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on pull request #40401: [SPARK-42773][DOCS][PYTHON] Minor update to 3.4.0 version change message for Spark Connect
zhengruifeng commented on PR #40401: URL: https://github.com/apache/spark/pull/40401#issuecomment-1467252638 thank you @allanf-db , merged into master/branch-3.4 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] atronchi commented on pull request #18990: [SPARK-21782][Core] Repartition creates skews when numPartitions is a power of 2
atronchi commented on PR #18990: URL: https://github.com/apache/spark/pull/18990#issuecomment-1467104804 This issue appears to remain at large for the dataframe API which is used more broadly than RDD. What would it take to extend the fix to the dataframe API? I verified this on Spark 3.2 using `df.repartition(1024)` on a dataframe with ~200k rows, which resulted in almost 30% EMPTY partitions, and the below shown skew of the remaining non-empty ones. https://user-images.githubusercontent.com/4906224/224852367-a33564c8-21f2-4daa-8da2-cbca0389594a.png;> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ueshin commented on a diff in pull request #40402: [SPARK-42020][CONNECT][PYTHON] Support UserDefinedType in Spark Connect
ueshin commented on code in PR #40402: URL: https://github.com/apache/spark/pull/40402#discussion_r1134750091 ## connector/connect/common/src/main/protobuf/spark/connect/base.proto: ## @@ -272,6 +272,9 @@ message ExecutePlanResponse { // The metrics observed during the execution of the query plan. repeated ObservedMetrics observed_metrics = 6; + // The Spark schema + DataType schema = 7; Review Comment: @grundprinzip The actual Spark data type is necessary to rebuild the UDT objects. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ueshin commented on a diff in pull request #40402: [SPARK-42020][CONNECT][PYTHON] Support UserDefinedType in Spark Connect
ueshin commented on code in PR #40402: URL: https://github.com/apache/spark/pull/40402#discussion_r1134750091 ## connector/connect/common/src/main/protobuf/spark/connect/base.proto: ## @@ -272,6 +272,9 @@ message ExecutePlanResponse { // The metrics observed during the execution of the query plan. repeated ObservedMetrics observed_metrics = 6; + // The Spark schema + DataType schema = 7; Review Comment: @grundprinzip The actual Spark data type in the execution result is necessary to rebuild the UDT objects. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] WeichenXu123 commented on a diff in pull request #40097: [SPARK-42508][CONNECT][ML] Extract the common .ml classes to `mllib-common`
WeichenXu123 commented on code in PR #40097: URL: https://github.com/apache/spark/pull/40097#discussion_r1134776526 ## connector/connect/client/jvm/src/main/scala/org/apache/spark/ml/Estimator.scala: ## @@ -0,0 +1,97 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml + +import scala.annotation.varargs + +import org.apache.spark.annotation.Since +import org.apache.spark.ml.param.{ParamMap, ParamPair} +import org.apache.spark.sql.Dataset + +/** + * Abstract class for estimators that fit models to data. + */ +abstract class Estimator[M <: Model[M]] extends PipelineStage { Review Comment: Similar question for Model / Transformer / Predictor / Pipeline -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] WeichenXu123 commented on a diff in pull request #40097: [SPARK-42508][CONNECT][ML] Extract the common .ml classes to `mllib-common`
WeichenXu123 commented on code in PR #40097: URL: https://github.com/apache/spark/pull/40097#discussion_r1134777302 ## connector/connect/client/jvm/src/main/scala/org/apache/spark/ml/feature/LabeledPoint.scala: ## @@ -0,0 +1,41 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import org.apache.spark.annotation.Since +import org.apache.spark.ml.linalg.Vector + +/** + * Class that represents the features and label of a data point. + * + * @param label + * Label for this data point. + * @param features + * List of features for this data point. + */ +@Since("3.5.0") +case class LabeledPoint(label: Double, features: Vector) { Review Comment: can we also move this to mllib-common ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] WeichenXu123 commented on pull request #40097: [SPARK-42508][CONNECT][ML] Extract the common .ml classes to `mllib-common`
WeichenXu123 commented on PR #40097: URL: https://github.com/apache/spark/pull/40097#issuecomment-1467213725 For the 2 exceptions: > 2, org.apache.spark.ml.linalg.* except VectorUDTSuite due to cyclical dependency; (it copies the VectorUDTSuite except test("JavaTypeInference with VectorUDT")) > 3, org.apache.spark.ml.param.* except ParamsSuite due to cyclical dependency; (it copies the ParamsSuite except test("Filtering ParamMap")) You can move "JavaTypeInference with VectorUDT" out of VectorUDTSuite, and move "Filtering ParamMap" out of ParamsSuite, and then can move remaining test code without duplication. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ueshin commented on a diff in pull request #40402: [SPARK-42020][CONNECT][PYTHON] Support UserDefinedType in Spark Connect
ueshin commented on code in PR #40402: URL: https://github.com/apache/spark/pull/40402#discussion_r1134793723 ## connector/connect/common/src/main/protobuf/spark/connect/base.proto: ## @@ -272,6 +272,9 @@ message ExecutePlanResponse { // The metrics observed during the execution of the query plan. repeated ObservedMetrics observed_metrics = 6; + // The Spark schema + DataType schema = 7; Review Comment: Yes, it's only for `df.collect` for now. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #40385: [SPARK-42753] ReusedExchange refers to non-existent nodes
cloud-fan commented on PR #40385: URL: https://github.com/apache/spark/pull/40385#issuecomment-1467245909 Yea AQE may remove materialized query stages due to optimizations like empty relation propagation, but I think it's fine as the shuffle files are still there (we don't unregister the shuffle), so the reused shuffle operator can still read these shuffle files using the shuffle id. The problem with EXPLAIN is it only looks for the referenced exchange in the query plan tree, I think we can also look up from the AQE stage cache map? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on a diff in pull request #40402: [SPARK-42020][CONNECT][PYTHON] Support UserDefinedType in Spark Connect
zhengruifeng commented on code in PR #40402: URL: https://github.com/apache/spark/pull/40402#discussion_r1134827687 ## python/pyspark/sql/connect/dataframe.py: ## @@ -1344,9 +1344,9 @@ def collect(self) -> List[Row]: if self._session is None: raise Exception("Cannot collect on empty session.") query = self._plan.to_proto(self._session.client) -table = self._session.client.to_table(query) +table, schema = self._session.client.to_table(query) -schema = from_arrow_schema(table.schema) +schema = schema or from_arrow_schema(table.schema) Review Comment: It seems that MapType's `valueContainsNull` is stored in `to_arrow_type` but discarded in `from_arrow_type`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on pull request #40097: [SPARK-42508][CONNECT][ML] Extract the common .ml classes to `mllib-common`
zhengruifeng commented on PR #40097: URL: https://github.com/apache/spark/pull/40097#issuecomment-1467297939 > For the 2 exceptions: > > > 2, org.apache.spark.ml.linalg.* except VectorUDTSuite due to cyclical dependency; (it copies the VectorUDTSuite except test("JavaTypeInference with VectorUDT")) > > 3, org.apache.spark.ml.param.* except ParamsSuite due to cyclical dependency; (it copies the ParamsSuite except test("Filtering ParamMap")) > > You can move "JavaTypeInference with VectorUDT" out of VectorUDTSuite, and move "Filtering ParamMap" out of ParamsSuite, and then can move remaining test code without duplication. good point, done -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gatorsmile commented on pull request #40216: [SPARK-42593][PS] Deprecate & remove the APIs that will be removed in pandas 2.0.
gatorsmile commented on PR #40216: URL: https://github.com/apache/spark/pull/40216#issuecomment-1467332483 Let us mention all the breaking changes and deprecation in both release notes and migration guides -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] panbingkun commented on a diff in pull request #40394: [SPARK-42771][SQL] Refactor HiveGenericUDF
panbingkun commented on code in PR #40394: URL: https://github.com/apache/spark/pull/40394#discussion_r1134790402 ## sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveUDFs.scala: ## @@ -235,6 +200,56 @@ private[hive] case class HiveGenericUDF( } } +class HiveGenericUDFHelper( Review Comment: OK -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on a diff in pull request #40097: [SPARK-42508][CONNECT][ML] Extract the common .ml classes to `mllib-common`
zhengruifeng commented on code in PR #40097: URL: https://github.com/apache/spark/pull/40097#discussion_r1134799615 ## connector/connect/client/jvm/src/main/scala/org/apache/spark/ml/Estimator.scala: ## @@ -0,0 +1,97 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml + +import scala.annotation.varargs + +import org.apache.spark.annotation.Since +import org.apache.spark.ml.param.{ParamMap, ParamPair} +import org.apache.spark.sql.Dataset + +/** + * Abstract class for estimators that fit models to data. + */ +abstract class Estimator[M <: Model[M]] extends PipelineStage { Review Comment: The reason that I didn't add `Transformer` (as well as its subclasses) into the `mllib-common` is to control the dependency: ``` def transform(dataset: Dataset[_], paramMap: ParamMap): DataFrame ``` `Transformer` depends on `Dataset`, I think it will not be involved in the catalyst common module cc @hvanhovell -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ueshin commented on a diff in pull request #40402: [SPARK-42020][CONNECT][PYTHON] Support UserDefinedType in Spark Connect
ueshin commented on code in PR #40402: URL: https://github.com/apache/spark/pull/40402#discussion_r1134799357 ## python/pyspark/sql/connect/dataframe.py: ## @@ -1344,9 +1344,9 @@ def collect(self) -> List[Row]: if self._session is None: raise Exception("Cannot collect on empty session.") query = self._plan.to_proto(self._session.client) -table = self._session.client.to_table(query) +table, schema = self._session.client.to_table(query) -schema = from_arrow_schema(table.schema) +schema = schema or from_arrow_schema(table.schema) Review Comment: Btw, for the collections, we can retrieve nullability from the schema. For example: ``` if pa.is_list(at): field = at.value_field ArrayType(from_arrow_type(field.type), containsNull=field.nullable) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] StevenChenDatabricks commented on pull request #40385: [SPARK-42753] ReusedExchange refers to non-existent nodes
StevenChenDatabricks commented on PR #40385: URL: https://github.com/apache/spark/pull/40385#issuecomment-1467287615 @cloud-fan Yes this is purely UI and EXPLAIN issue. It does not affect query execution. I'm not sure how AQE context stageCache map would help. The issue in EXPLAIN is that the ReusedExchange.child references a Exchange node that is not referenced anywhere else in the plan tree so we need to generate IDs on the subtree rooted at ReusedExchange.child and print them out. To do this, we need a way to check whether the ReusedExchange.child is referenced anywhere else - if they are not referenced anywhere else, we need to recursively generate IDs for subtree. I keep a HashSet of nodes with IDs already generated and check ReusedExchange.child against it to see if we need to recursively generate IDs on the subtree. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] github-actions[bot] closed pull request #38887: [SPARK-41368][SQL] Reorder the window partition expressions by expression stats
github-actions[bot] closed pull request #38887: [SPARK-41368][SQL] Reorder the window partition expressions by expression stats URL: https://github.com/apache/spark/pull/38887 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] WeichenXu123 commented on a diff in pull request #40097: [SPARK-42508][CONNECT][ML] Extract the common .ml classes to `mllib-common`
WeichenXu123 commented on code in PR #40097: URL: https://github.com/apache/spark/pull/40097#discussion_r1134775534 ## connector/connect/client/jvm/src/main/scala/org/apache/spark/ml/Estimator.scala: ## @@ -0,0 +1,97 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml + +import scala.annotation.varargs + +import org.apache.spark.annotation.Since +import org.apache.spark.ml.param.{ParamMap, ParamPair} +import org.apache.spark.sql.Dataset + +/** + * Abstract class for estimators that fit models to data. + */ +abstract class Estimator[M <: Model[M]] extends PipelineStage { Review Comment: Q: Why not move this to mllib-common ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] xinrong-meng opened a new pull request, #40405: [WIP][SPARK-42340][CONNECT][PYTHON] Implement `GroupedData.applyInPandas`
xinrong-meng opened a new pull request, #40405: URL: https://github.com/apache/spark/pull/40405 - [ ] Parity tests ### What changes were proposed in this pull request? Implement `GroupedData.applyInPandas`. ### Why are the changes needed? Parity with vanilla PySpark. ### Does this PR introduce _any_ user-facing change? Yes. `GroupedData.applyInPandas` is supported now. ### How was this patch tested? Unit tests. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ulysses-you opened a new pull request, #40406: [SPARK-42101][SQL][FOLLOWUP] Improve TableCacheQueryStage with CoalesceShufflePartitions
ulysses-you opened a new pull request, #40406: URL: https://github.com/apache/spark/pull/40406 ### What changes were proposed in this pull request? `CoalesceShufflePartitions` should make sure all leaves are `ExchangeQueryStageExec` to avoid collect `TableCacheQueryStage`. As we can not change the partition number of IMR. Add two tests to make sure `CoalesceShufflePartitions` works well with `TableCacheQueryStage`. Note that, these two tests work without this pr, thanks to `ValidateRequirements` the wrong plan has been reverted. ### Why are the changes needed? Avoid potential issue. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? add test -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] pan3793 commented on pull request #39160: [SPARK-41667][K8S] Expose env var SPARK_DRIVER_POD_NAME in Driver Pod
pan3793 commented on PR #39160: URL: https://github.com/apache/spark/pull/39160#issuecomment-1467341858 I found that [apple/batch-processing-gateway](https://github.com/apple/batch-processing-gateway) uses Pod Name to fetch the log as well https://github.com/apple/batch-processing-gateway/blob/955d6a6f6dd0adf8e886cad0e5ff926e87697101/src/main/java/com/apple/spark/rest/ApplicationGetLogRest.java#L327 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on a diff in pull request #40395: [SPARK-42770][CONNECT] Add `truncatedTo(ChronoUnit.MICROS)` to make `SQLImplicitsTestSuite` in Java 17 daily test GA task pass
LuciferYang commented on code in PR #40395: URL: https://github.com/apache/spark/pull/40395#discussion_r1134966350 ## connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/SQLImplicitsTestSuite.scala: ## @@ -130,9 +132,21 @@ class SQLImplicitsTestSuite extends ConnectFunSuite with BeforeAndAfterAll { testImplicit(BigDecimal(decimal)) testImplicit(Date.valueOf(LocalDate.now())) testImplicit(LocalDate.now()) -testImplicit(LocalDateTime.now()) -testImplicit(Instant.now()) -testImplicit(Timestamp.from(Instant.now())) +// SPARK-42770: Run `LocalDateTime.now()` and `Instant.now()` with Java 8 & 11 always +// get microseconds on both Linux and MacOS, but there are some differences when +// using Java 17, it will get accurate nanoseconds on Linux, but still get the microseconds +// on MacOS. At present, Spark always converts them to microseconds, this will cause the +// test fail when using Java 17 on Linux, so add `truncatedTo(ChronoUnit.MICROS)` when +// testing on Linux using Java 17 to ensure the accuracy of input data is microseconds. +if (SystemUtils.isJavaVersionAtLeast(JavaVersion.JAVA_17) && SystemUtils.IS_OS_LINUX) { Review Comment: @dongjoon-hyun like this? Let me double check the new change on Linux & Java 17 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] itholic commented on pull request #40401: [SPARK-42773][DOCS][PYTHON] Minor update to 3.4.0 version change message for Spark Connect
itholic commented on PR #40401: URL: https://github.com/apache/spark/pull/40401#issuecomment-1467125916 Looks good otherwise -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] itholic commented on pull request #40401: [SPARK-42773][DOCS][PYTHON] Minor update to 3.4.0 version change message for Spark Connect
itholic commented on PR #40401: URL: https://github.com/apache/spark/pull/40401#issuecomment-1467125777 Seems like there are some more unchanged docstrings in several files as below: ``` spark % git grep "Support Spark Connect" python/pyspark/sql/column.py:Support Spark Connect. python/pyspark/sql/column.py:Support Spark Connect. python/pyspark/sql/dataframe.py:Support Spark Connect. python/pyspark/sql/dataframe.py:Support Spark Connect. python/pyspark/sql/udf.py:Support Spark Connect. python/pyspark/sql/udf.py:Support Spark Connect. ``` Btw, you can change the all strings at once by running the shell command in terminal such as: ```shell find . -name "*.py" -exec perl -pi -e 's/Support Spark Connect/Supports Spark Connect/g' {} \; ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] wangyum closed pull request #40190: [SPARK-42597][SQL] Support unwrap date type to timestamp type
wangyum closed pull request #40190: [SPARK-42597][SQL] Support unwrap date type to timestamp type URL: https://github.com/apache/spark/pull/40190 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] wangyum commented on pull request #40190: [SPARK-42597][SQL] Support unwrap date type to timestamp type
wangyum commented on PR #40190: URL: https://github.com/apache/spark/pull/40190#issuecomment-1467139918 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ueshin commented on a diff in pull request #40402: [SPARK-42020][CONNECT][PYTHON] Support UserDefinedType in Spark Connect
ueshin commented on code in PR #40402: URL: https://github.com/apache/spark/pull/40402#discussion_r1134749726 ## connector/connect/common/src/main/protobuf/spark/connect/base.proto: ## Review Comment: @grundprinzip The actual Spark data type is necessary to rebuild the UDT objects. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ueshin commented on a diff in pull request #40402: [SPARK-42020][CONNECT][PYTHON] Support UserDefinedType in Spark Connect
ueshin commented on code in PR #40402: URL: https://github.com/apache/spark/pull/40402#discussion_r1134796113 ## python/pyspark/sql/connect/dataframe.py: ## @@ -1344,9 +1344,9 @@ def collect(self) -> List[Row]: if self._session is None: raise Exception("Cannot collect on empty session.") query = self._plan.to_proto(self._session.client) -table = self._session.client.to_table(query) +table, schema = self._session.client.to_table(query) -schema = from_arrow_schema(table.schema) +schema = schema or from_arrow_schema(table.schema) Review Comment: That's interesting and I was thinking about the similar thing, but I didn't take the approach because it takes some space in the `RecordBatch` that could repeatedly be sent to client. That means the schema space could be huge if repeat many times. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #40142: [SPARK-41171][SQL] Infer and push down window limit through window if partitionSpec is empty
cloud-fan commented on code in PR #40142: URL: https://github.com/apache/spark/pull/40142#discussion_r1134795703 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala: ## @@ -130,6 +130,7 @@ abstract class Optimizer(catalogManager: CatalogManager) val operatorOptimizationBatch: Seq[Batch] = { Batch("Operator Optimization before Inferring Filters", fixedPoint, operatorOptimizationRuleSet: _*) :: + Batch("Infer window group limit", Once, InferWindowGroupLimit) :: Review Comment: I think this is an unrelated change and is actually a regression. We may infer filters and it's better to run window limit optimization as late as possible. @beliefer can we revert this change about moving the rule? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ueshin commented on a diff in pull request #40402: [SPARK-42020][CONNECT][PYTHON] Support UserDefinedType in Spark Connect
ueshin commented on code in PR #40402: URL: https://github.com/apache/spark/pull/40402#discussion_r1134799357 ## python/pyspark/sql/connect/dataframe.py: ## @@ -1344,9 +1344,9 @@ def collect(self) -> List[Row]: if self._session is None: raise Exception("Cannot collect on empty session.") query = self._plan.to_proto(self._session.client) -table = self._session.client.to_table(query) +table, schema = self._session.client.to_table(query) -schema = from_arrow_schema(table.schema) +schema = schema or from_arrow_schema(table.schema) Review Comment: Btw, for the collections, we can retrieve nullability from the schema. For example: ```py if pa.is_list(at): field = at.value_field ArrayType(from_arrow_type(field.type), containsNull=field.nullable) ``` I guess map type also works? ```py if pa.is_map(at): valueContainsNull = at.item_field.nullable ... ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on a diff in pull request #40097: [SPARK-42508][CONNECT][ML] Extract the common .ml classes to `mllib-common`
zhengruifeng commented on code in PR #40097: URL: https://github.com/apache/spark/pull/40097#discussion_r1134881273 ## connector/connect/client/jvm/src/main/scala/org/apache/spark/ml/feature/LabeledPoint.scala: ## @@ -0,0 +1,41 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import org.apache.spark.annotation.Since +import org.apache.spark.ml.linalg.Vector + +/** + * Class that represents the features and label of a data point. + * + * @param label + * Label for this data point. + * @param features + * List of features for this data point. + */ +@Since("3.5.0") +case class LabeledPoint(label: Double, features: Vector) { Review Comment: done, but we can not change the since version then -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gengliangwang commented on pull request #40404: [SPARK-42777][SQL] Support converting TimestampNTZ catalog stats to plan stats
gengliangwang commented on PR #40404: URL: https://github.com/apache/spark/pull/40404#issuecomment-1467317784 merging to master/3.4 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gengliangwang closed pull request #40404: [SPARK-42777][SQL] Support converting TimestampNTZ catalog stats to plan stats
gengliangwang closed pull request #40404: [SPARK-42777][SQL] Support converting TimestampNTZ catalog stats to plan stats URL: https://github.com/apache/spark/pull/40404 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gatorsmile commented on pull request #40336: [SPARK-42706][SQL][DOCS] Document the Spark SQL error classes in user-facing documentation.
gatorsmile commented on PR #40336: URL: https://github.com/apache/spark/pull/40336#issuecomment-1467329403 @MaxGekk should we merge it to 3.4? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] WeichenXu123 commented on a diff in pull request #40097: [SPARK-42508][CONNECT][ML] Extract the common .ml classes to `mllib-common`
WeichenXu123 commented on code in PR #40097: URL: https://github.com/apache/spark/pull/40097#discussion_r1134945494 ## connector/connect/client/jvm/src/main/scala/org/apache/spark/ml/Estimator.scala: ## @@ -0,0 +1,97 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml + +import scala.annotation.varargs + +import org.apache.spark.annotation.Since +import org.apache.spark.ml.param.{ParamMap, ParamPair} +import org.apache.spark.sql.Dataset + +/** + * Abstract class for estimators that fit models to data. + */ +abstract class Estimator[M <: Model[M]] extends PipelineStage { Review Comment: Can we change it to `def transform(dataset: DataFrame, paramMap: ParamMap): DataFrame` ? Shall we still support Dataset[_] input ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] WeichenXu123 commented on a diff in pull request #40097: [SPARK-42508][CONNECT][ML] Extract the common .ml classes to `mllib-common`
WeichenXu123 commented on code in PR #40097: URL: https://github.com/apache/spark/pull/40097#discussion_r1134961956 ## mllib/core/src/test/scala/org/apache/spark/ml/attribute/AttributeGroupSuite.scala: ## @@ -1,65 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - *http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -package org.apache.spark.ml.attribute - -import org.apache.spark.SparkFunSuite - -class AttributeGroupSuite extends SparkFunSuite { - - test("attribute group") { -val attrs = Array( - NumericAttribute.defaultAttr, - NominalAttribute.defaultAttr, - BinaryAttribute.defaultAttr.withIndex(0), - NumericAttribute.defaultAttr.withName("age").withSparsity(0.8), - NominalAttribute.defaultAttr.withName("size").withValues("small", "medium", "large"), - BinaryAttribute.defaultAttr.withName("clicked").withValues("no", "yes"), - NumericAttribute.defaultAttr, - NumericAttribute.defaultAttr) -val group = new AttributeGroup("user", attrs) -assert(group.size === 8) -assert(group.name === "user") -assert(group(0) === NumericAttribute.defaultAttr.withIndex(0)) -assert(group(2) === BinaryAttribute.defaultAttr.withIndex(2)) -assert(group.indexOf("age") === 3) -assert(group.indexOf("size") === 4) -assert(group.indexOf("clicked") === 5) -assert(!group.hasAttr("abc")) -intercept[NoSuchElementException] { - group("abc") -} -assert(group === AttributeGroup.fromMetadata(group.toMetadataImpl, group.name)) -assert(group === AttributeGroup.fromStructField(group.toStructField())) - } - - test("attribute group without attributes") { -val group0 = new AttributeGroup("user", 10) -assert(group0.name === "user") -assert(group0.numAttributes === Some(10)) -assert(group0.size === 10) -assert(group0.attributes.isEmpty) -assert(group0 === AttributeGroup.fromMetadata(group0.toMetadataImpl, group0.name)) -assert(group0 === AttributeGroup.fromStructField(group0.toStructField())) - -val group1 = new AttributeGroup("item") -assert(group1.name === "item") -assert(group1.numAttributes.isEmpty) -assert(group1.attributes.isEmpty) -assert(group1.size === -1) - } -} Review Comment: Got it, I clicked into the diff for single commit. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] srowen commented on pull request #18990: [SPARK-21782][Core] Repartition creates skews when numPartitions is a power of 2
srowen commented on PR #18990: URL: https://github.com/apache/spark/pull/18990#issuecomment-1467131067 @atronchi what is "df" here? I couldn't reproduce that with a DF of 200K simple rows -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] linhongliu-db commented on pull request #40403: [SPARK-42754][SQL][UI] Fix backward compatibility issue in nested SQL execution
linhongliu-db commented on PR #40403: URL: https://github.com/apache/spark/pull/40403#issuecomment-1467162461 cc @JoshRosen @rednaxelafx @xinrong-meng -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on pull request #40401: [SPARK-42773][DOCS][PYTHON] Minor update to 3.4.0 version change message for Spark Connect
zhengruifeng commented on PR #40401: URL: https://github.com/apache/spark/pull/40401#issuecomment-1467188490 LGTM pending CI -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] WeichenXu123 commented on a diff in pull request #40097: [SPARK-42508][CONNECT][ML] Extract the common .ml classes to `mllib-common`
WeichenXu123 commented on code in PR #40097: URL: https://github.com/apache/spark/pull/40097#discussion_r1134786529 ## mllib/core/src/main/scala/org/apache/spark/ml/param/shared/HasExecutionContext.scala: ## @@ -0,0 +1,40 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.param.shared + +import scala.concurrent.ExecutionContext + +import org.apache.spark.util.ThreadUtils + +private[ml] trait HasExecutionContext extends HasParallelism { + + /** + * Create a new execution context with a thread-pool that has a maximum number of threads + * set to the value of [[parallelism]]. If this param is set to 1, a same-thread executor + * will be used to run in serial. + */ + private[ml] def getExecutionContext: ExecutionContext = { +getParallelism match { + case 1 => +ThreadUtils.sameThread + case n => +ExecutionContext.fromExecutorService(ThreadUtils + .newDaemonCachedThreadPool(s"${this.getClass.getSimpleName}-thread-pool", n)) +} + } Review Comment: I prefer to convert this to a helper function and move it to mllib-common. and we don't need to add a new trait HasExecutionContext. `HasParallelism` shows it is a parallelism param, but `HasExecutionContext` is weird, it inherits HasParallelism , this way makes code confusing -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on a diff in pull request #40402: [SPARK-42020][CONNECT][PYTHON] Support UserDefinedType in Spark Connect
zhengruifeng commented on code in PR #40402: URL: https://github.com/apache/spark/pull/40402#discussion_r1134776117 ## connector/connect/common/src/main/protobuf/spark/connect/base.proto: ## @@ -272,6 +272,9 @@ message ExecutePlanResponse { // The metrics observed during the execution of the query plan. repeated ObservedMetrics observed_metrics = 6; + // The Spark schema + DataType schema = 7; Review Comment: Is it a optional field only available in `df.collect`? ## python/pyspark/sql/connect/dataframe.py: ## @@ -1344,9 +1344,9 @@ def collect(self) -> List[Row]: if self._session is None: raise Exception("Cannot collect on empty session.") query = self._plan.to_proto(self._session.client) -table = self._session.client.to_table(query) +table, schema = self._session.client.to_table(query) -schema = from_arrow_schema(table.schema) +schema = schema or from_arrow_schema(table.schema) Review Comment: when I was implementing the collection I was thinking about this dumb question: is it possible to make arrow schema 100% compatible with spark schema if we store the different fields in arrow metadata? e.g. for a spark `MapType`, store the `valueContainsNull` as a metadata in the `pa.schema`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #40394: [SPARK-42771][SQL] Refactor HiveGenericUDF
cloud-fan commented on code in PR #40394: URL: https://github.com/apache/spark/pull/40394#discussion_r1134828589 ## sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveUDFs.scala: ## @@ -235,6 +200,56 @@ private[hive] case class HiveGenericUDF( } } +class HiveGenericUDFHelper( +funcWrapper: HiveFunctionWrapper, children: Seq[Expression]) + extends HiveInspectors + with Serializable + with Logging { + + @transient + private[hive] val deterministic = isUDFDeterministic && children.forall(_.deterministic) + + @transient + private[hive] val foldable = Review Comment: SGTM, we can probably add a base class `HiveUDFEvaluatorBase` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #40394: [SPARK-42771][SQL] Refactor HiveGenericUDF
cloud-fan commented on code in PR #40394: URL: https://github.com/apache/spark/pull/40394#discussion_r1134828589 ## sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveUDFs.scala: ## @@ -235,6 +200,56 @@ private[hive] case class HiveGenericUDF( } } +class HiveGenericUDFHelper( +funcWrapper: HiveFunctionWrapper, children: Seq[Expression]) + extends HiveInspectors + with Serializable + with Logging { + + @transient + private[hive] val deterministic = isUDFDeterministic && children.forall(_.deterministic) + + @transient + private[hive] val foldable = Review Comment: SGTM -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #40395: [SPARK-42770][CONNECT] Add `truncatedTo(ChronoUnit.MICROS)` to make `SQLImplicitsTestSuite` in Java 17 daily test GA task pass
LuciferYang commented on PR #40395: URL: https://github.com/apache/spark/pull/40395#issuecomment-1467321228 friendly ping @dongjoon-hyun -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Stove-hust commented on pull request #40393: []SPARK-40082]
Stove-hust commented on PR #40393: URL: https://github.com/apache/spark/pull/40393#issuecomment-1467339828 > @Stove-hust Thank you for reporting and the patch. Would you be able to share driver logs? ** --- stage 10 faield 22/10/15 10:55:58 WARN task-result-getter-1 TaskSetManager: Lost task 435.1 in stage 10.0 (TID 6822, zw02-data-hdp-dn21102.mt, executor 101): FetchFailed(null, shuffleId=3, mapIndex=-1, mapId=-1, reduceId=435, message= org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 3 partition 435 22/10/15 10:55:58 INFO dag-scheduler-event-loop DAGScheduler: ShuffleMapStage 10 (processCmd at CliDriver.java:386) failed in 601.792 s due to org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 3 partition 435 --- resubmit stage 10 && parentStage 9 22/10/15 10:55:58 INFO dag-scheduler-event-loop DAGScheduler: Resubmitting ShuffleMapStage 9 (processCmd at CliDriver.java:386) and ShuffleMapStage 10 (processCmd at CliDriver.java:386) due to fetch failure 22/10/15 10:55:58 INFO dag-scheduler-event-loop DAGScheduler: Resubmitting failed stages 22/10/15 10:55:58 INFO dag-scheduler-event-loop DAGScheduler: Submitting ShuffleMapStage 9 (MapPartitionsRDD[22] at processCmd at CliDriver.java:386), which has no missing parents 22/10/15 10:55:58 INFO dag-scheduler-event-loop DAGScheduler: Push-based shuffle disabled for ShuffleMapStage 9 (processCmd at CliDriver.java:386) since it is already shuffle merge finalized 22/10/15 10:55:58 INFO dag-scheduler-event-loop DAGScheduler: Submitting 3 missing tasks from ShuffleMapStage 9 (MapPartitionsRDD[22] at processCmd at CliDriver.java:386) (first 15 tasks are for partitions Vector(98, 372, 690)) 22/10/15 10:55:58 INFO dag-scheduler-event-loop YarnClusterScheduler: Adding task set 9.1 with 3 tasks --- The first stage10 task completes one after another, and notifyDriverAboutPushCompletion to end stage 10, and mark finalizeTask, because the stage is not in runningStages, so the stage cannot be marked shuffleMergeFinalized. 22/10/15 10:55:58 INFO task-result-getter-0 TaskSetManager: Finished task 325.0 in stage 10.0 (TID 6166) in 154455 ms on zw02-data-hdp-dn25537.mt (executor 117) (494/500) 22/10/15 10:55:59 WARN task-result-getter-1 TaskSetManager: Lost task 325.1 in stage 10.0 (TID 6671, zw02-data-hdp-dn23160.mt, executor 47): TaskKilled (another attempt succeeded) 22/10/15 10:56:20 WARN task-result-getter-1 TaskSetManager: Lost task 358.1 in stage 10.0 (TID 6731, zw02-data-hdp-dn25537.mt, executor 95): TaskKilled (another attempt succeeded) 22/10/15 10:56:20 INFO task-result-getter-1 TaskSetManager: Task 358.1 in stage 10.0 (TID 6731) failed, but the task will not be re-executed (either because the task failed with a shuffle data fetch failure, so the previous stage needs to be re-run, or because a different copy of the task has already succeeded). --- Removed TaskSet 10.0, whose tasks have all completed 22/10/15 10:56:22 INFO task-result-getter-1 TaskSetManager: Ignoring task-finished event for 435.0 in stage 10.0 because task 435 has already completed successfully 22/10/15 10:56:22 INFO task-result-getter-1 YarnClusterScheduler: Removed TaskSet 10.0, whose tasks have all completed, from pool --- notifyDriverAboutPushCompletion stage 10 22/10/15 10:56:23 INFO dag-scheduler-event-loop DAGScheduler: ShuffleMapStage 10 (processCmd at CliDriver.java:386) scheduled for finalizing shuffle merge in 0 s 22/10/15 10:56:23 INFO shuffle-merge-finalizer-2 DAGScheduler: ShuffleMapStage 10 (processCmd at CliDriver.java:386) finalizing the shuffle merge with registering merge results set to true --- stage 9 finished 22/10/15 10:57:51 INFO task-result-getter-1 TaskSetManager: Finished task 2.0 in stage 9.1 (TID 6825) in 112825 ms on zw02-data-hdp-dn25559.mt (executor 74) (3/3) 22/10/15 10:57:51 INFO task-result-getter-1 YarnClusterScheduler: Removed TaskSet 9.1, whose tasks have all completed, from pool 22/10/15 10:57:51 INFO dag-scheduler-event-loop DAGScheduler: ShuffleMapStage 9 (processCmd at CliDriver.java:386) finished in 112.832 s --- resubmit stage 10 2/10/15 10:57:51 INFO dag-scheduler-event-loop DAGScheduler: looking for newly runnable stages 22/10/15 10:57:51 INFO dag-scheduler-event-loop DAGScheduler: running: Set(ShuffleMapStage 11, ShuffleMapStage 8) 22/10/15 10:57:51 INFO dag-scheduler-event-loop DAGScheduler: waiting: Set(ShuffleMapStage 12, ShuffleMapStage 10) 22/10/15 10:57:51 INFO dag-scheduler-event-loop DAGScheduler: failed: Set() 22/10/15 10:57:51 INFO dag-scheduler-event-loop DAGScheduler: Submitting ShuffleMapStage 10 (MapPartitionsRDD[36] at processCmd at CliDriver.java:386), which has no missing parents 22/10/15 10:57:51 INFO dag-scheduler-event-loop OutputCommitCoordinator: Reusing state from
[GitHub] [spark] Stove-hust commented on pull request #40393: []SPARK-40082]
Stove-hust commented on PR #40393: URL: https://github.com/apache/spark/pull/40393#issuecomment-1467340408 > @Stove-hust Thank you for reporting and the patch. Would you be able to share driver logs? Sure(Add some comments) --- stage 10 faield 22/10/15 10:55:58 WARN task-result-getter-1 TaskSetManager: Lost task 435.1 in stage 10.0 (TID 6822, zw02-data-hdp-dn21102.mt, executor 101): FetchFailed(null, shuffleId=3, mapIndex=-1, mapId=-1, reduceId=435, message= org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 3 partition 435 22/10/15 10:55:58 INFO dag-scheduler-event-loop DAGScheduler: ShuffleMapStage 10 (processCmd at CliDriver.java:386) failed in 601.792 s due to org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 3 partition 435 -- resubmit stage 10 && parentStage 9 22/10/15 10:55:58 INFO dag-scheduler-event-loop DAGScheduler: Resubmitting ShuffleMapStage 9 (processCmd at CliDriver.java:386) and ShuffleMapStage 10 (processCmd at CliDriver.java:386) due to fetch failure 22/10/15 10:55:58 INFO dag-scheduler-event-loop DAGScheduler: Resubmitting failed stages 22/10/15 10:55:58 INFO dag-scheduler-event-loop DAGScheduler: Submitting ShuffleMapStage 9 (MapPartitionsRDD[22] at processCmd at CliDriver.java:386), which has no missing parents 22/10/15 10:55:58 INFO dag-scheduler-event-loop DAGScheduler: Push-based shuffle disabled for ShuffleMapStage 9 (processCmd at CliDriver.java:386) since it is already shuffle merge finalized 22/10/15 10:55:58 INFO dag-scheduler-event-loop DAGScheduler: Submitting 3 missing tasks from ShuffleMapStage 9 (MapPartitionsRDD[22] at processCmd at CliDriver.java:386) (first 15 tasks are for partitions Vector(98, 372, 690)) 22/10/15 10:55:58 INFO dag-scheduler-event-loop YarnClusterScheduler: Adding task set 9.1 with 3 tasks -- The first stage10 task completes one after another, and notifyDriverAboutPushCompletion to end stage 10, and mark finalizeTask, because the stage is not in runningStages, so the stage cannot be marked shuffleMergeFinalized. 22/10/15 10:55:58 INFO task-result-getter-0 TaskSetManager: Finished task 325.0 in stage 10.0 (TID 6166) in 154455 ms on zw02-data-hdp-dn25537.mt (executor 117) (494/500) 22/10/15 10:55:59 WARN task-result-getter-1 TaskSetManager: Lost task 325.1 in stage 10.0 (TID 6671, zw02-data-hdp-dn23160.mt, executor 47): TaskKilled (another attempt succeeded) 22/10/15 10:56:20 WARN task-result-getter-1 TaskSetManager: Lost task 358.1 in stage 10.0 (TID 6731, zw02-data-hdp-dn25537.mt, executor 95): TaskKilled (another attempt succeeded) 22/10/15 10:56:20 INFO task-result-getter-1 TaskSetManager: Task 358.1 in stage 10.0 (TID 6731) failed, but the task will not be re-executed (either because the task failed with a shuffle data fetch failure, so the previous stage needs to be re-run, or because a different copy of the task has already succeeded). --- Removed TaskSet 10.0, whose tasks have all completed 22/10/15 10:56:22 INFO task-result-getter-1 TaskSetManager: Ignoring task-finished event for 435.0 in stage 10.0 because task 435 has already completed successfully 22/10/15 10:56:22 INFO task-result-getter-1 YarnClusterScheduler: Removed TaskSet 10.0, whose tasks have all completed, from pool --- notifyDriverAboutPushCompletion stage 10 22/10/15 10:56:23 INFO dag-scheduler-event-loop DAGScheduler: ShuffleMapStage 10 (processCmd at CliDriver.java:386) scheduled for finalizing shuffle merge in 0 s 22/10/15 10:56:23 INFO shuffle-merge-finalizer-2 DAGScheduler: ShuffleMapStage 10 (processCmd at CliDriver.java:386) finalizing the shuffle merge with registering merge results set to true --- stage 9 finished 22/10/15 10:57:51 INFO task-result-getter-1 TaskSetManager: Finished task 2.0 in stage 9.1 (TID 6825) in 112825 ms on zw02-data-hdp-dn25559.mt (executor 74) (3/3) 22/10/15 10:57:51 INFO task-result-getter-1 YarnClusterScheduler: Removed TaskSet 9.1, whose tasks have all completed, from pool 22/10/15 10:57:51 INFO dag-scheduler-event-loop DAGScheduler: ShuffleMapStage 9 (processCmd at CliDriver.java:386) finished in 112.832 s --- resubmit stage 10 2/10/15 10:57:51 INFO dag-scheduler-event-loop DAGScheduler: looking for newly runnable stages 22/10/15 10:57:51 INFO dag-scheduler-event-loop DAGScheduler: running: Set(ShuffleMapStage 11, ShuffleMapStage 8) 22/10/15 10:57:51 INFO dag-scheduler-event-loop DAGScheduler: waiting: Set(ShuffleMapStage 12, ShuffleMapStage 10) 22/10/15 10:57:51 INFO dag-scheduler-event-loop DAGScheduler: failed: Set() 22/10/15 10:57:51 INFO dag-scheduler-event-loop DAGScheduler: Submitting ShuffleMapStage 10 (MapPartitionsRDD[36] at processCmd at CliDriver.java:386), which has no missing parents 22/10/15 10:57:51 INFO dag-scheduler-event-loop OutputCommitCoordinator:
[GitHub] [spark] Stove-hust commented on pull request #40393: []SPARK-40082]
Stove-hust commented on PR #40393: URL: https://github.com/apache/spark/pull/40393#issuecomment-1467339346 > @Stove-hust Thank you for reporting and the patch. Would you be able to share driver logs? sure. `# stage 10 faield 22/10/15 10:55:58 WARN task-result-getter-1 TaskSetManager: Lost task 435.1 in stage 10.0 (TID 6822, zw02-data-hdp-dn21102.mt, executor 101): FetchFailed(null, shuffleId=3, mapIndex=-1, mapId=-1, reduceId=435, message= org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 3 partition 435 22/10/15 10:55:58 INFO dag-scheduler-event-loop DAGScheduler: ShuffleMapStage 10 (processCmd at CliDriver.java:386) failed in 601.792 s due to org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 3 partition 435 # resubmit stage 10 && parentStage 9 22/10/15 10:55:58 INFO dag-scheduler-event-loop DAGScheduler: Resubmitting ShuffleMapStage 9 (processCmd at CliDriver.java:386) and ShuffleMapStage 10 (processCmd at CliDriver.java:386) due to fetch failure 22/10/15 10:55:58 INFO dag-scheduler-event-loop DAGScheduler: Resubmitting failed stages 22/10/15 10:55:58 INFO dag-scheduler-event-loop DAGScheduler: Submitting ShuffleMapStage 9 (MapPartitionsRDD[22] at processCmd at CliDriver.java:386), which has no missing parents 22/10/15 10:55:58 INFO dag-scheduler-event-loop DAGScheduler: Push-based shuffle disabled for ShuffleMapStage 9 (processCmd at CliDriver.java:386) since it is already shuffle merge finalized 22/10/15 10:55:58 INFO dag-scheduler-event-loop DAGScheduler: Submitting 3 missing tasks from ShuffleMapStage 9 (MapPartitionsRDD[22] at processCmd at CliDriver.java:386) (first 15 tasks are for partitions Vector(98, 372, 690)) 22/10/15 10:55:58 INFO dag-scheduler-event-loop YarnClusterScheduler: Adding task set 9.1 with 3 tasks # The first stage10 task completes one after another, and notifyDriverAboutPushCompletion to end stage 10, and mark finalizeTask, because the stage is not in runningStages, so the stage cannot be marked shuffleMergeFinalized. 22/10/15 10:55:58 INFO task-result-getter-0 TaskSetManager: Finished task 325.0 in stage 10.0 (TID 6166) in 154455 ms on zw02-data-hdp-dn25537.mt (executor 117) (494/500) 22/10/15 10:55:59 WARN task-result-getter-1 TaskSetManager: Lost task 325.1 in stage 10.0 (TID 6671, zw02-data-hdp-dn23160.mt, executor 47): TaskKilled (another attempt succeeded) 22/10/15 10:56:20 WARN task-result-getter-1 TaskSetManager: Lost task 358.1 in stage 10.0 (TID 6731, zw02-data-hdp-dn25537.mt, executor 95): TaskKilled (another attempt succeeded) 22/10/15 10:56:20 INFO task-result-getter-1 TaskSetManager: Task 358.1 in stage 10.0 (TID 6731) failed, but the task will not be re-executed (either because the task failed with a shuffle data fetch failure, so the previous stage needs to be re-run, or because a different copy of the task has already succeeded). # Removed TaskSet 10.0, whose tasks have all completed 22/10/15 10:56:22 INFO task-result-getter-1 TaskSetManager: Ignoring task-finished event for 435.0 in stage 10.0 because task 435 has already completed successfully 22/10/15 10:56:22 INFO task-result-getter-1 YarnClusterScheduler: Removed TaskSet 10.0, whose tasks have all completed, from pool # notifyDriverAboutPushCompletion stage 10 22/10/15 10:56:23 INFO dag-scheduler-event-loop DAGScheduler: ShuffleMapStage 10 (processCmd at CliDriver.java:386) scheduled for finalizing shuffle merge in 0 s 22/10/15 10:56:23 INFO shuffle-merge-finalizer-2 DAGScheduler: ShuffleMapStage 10 (processCmd at CliDriver.java:386) finalizing the shuffle merge with registering merge results set to true # stage 9 finished 22/10/15 10:57:51 INFO task-result-getter-1 TaskSetManager: Finished task 2.0 in stage 9.1 (TID 6825) in 112825 ms on zw02-data-hdp-dn25559.mt (executor 74) (3/3) 22/10/15 10:57:51 INFO task-result-getter-1 YarnClusterScheduler: Removed TaskSet 9.1, whose tasks have all completed, from pool 22/10/15 10:57:51 INFO dag-scheduler-event-loop DAGScheduler: ShuffleMapStage 9 (processCmd at CliDriver.java:386) finished in 112.832 s # resubmit stage 10 2/10/15 10:57:51 INFO dag-scheduler-event-loop DAGScheduler: looking for newly runnable stages 22/10/15 10:57:51 INFO dag-scheduler-event-loop DAGScheduler: running: Set(ShuffleMapStage 11, ShuffleMapStage 8) 22/10/15 10:57:51 INFO dag-scheduler-event-loop DAGScheduler: waiting: Set(ShuffleMapStage 12, ShuffleMapStage 10) 22/10/15 10:57:51 INFO dag-scheduler-event-loop DAGScheduler: failed: Set() 22/10/15 10:57:51 INFO dag-scheduler-event-loop DAGScheduler: Submitting ShuffleMapStage 10 (MapPartitionsRDD[36] at processCmd at CliDriver.java:386), which has no missing parents 22/10/15 10:57:51 INFO dag-scheduler-event-loop OutputCommitCoordinator: Reusing state from previous
[GitHub] [spark] cloud-fan commented on a diff in pull request #40394: [SPARK-42771][SQL] Refactor HiveGenericUDF
cloud-fan commented on code in PR #40394: URL: https://github.com/apache/spark/pull/40394#discussion_r1134994989 ## sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveUDFs.scala: ## @@ -129,58 +129,25 @@ private[hive] class DeferredObjectAdapter(oi: ObjectInspector, dataType: DataTyp private[hive] case class HiveGenericUDF( name: String, funcWrapper: HiveFunctionWrapper, children: Seq[Expression]) extends Expression - with HiveInspectors - with Logging with UserDefinedExpression { override def nullable: Boolean = true - override lazy val deterministic: Boolean = isUDFDeterministic && children.forall(_.deterministic) - - override def foldable: Boolean = -isUDFDeterministic && returnInspector.isInstanceOf[ConstantObjectInspector] - - @transient - lazy val function = funcWrapper.createFunction[GenericUDF]() + override lazy val deterministic: Boolean = evaluator.deterministic - @transient - private lazy val argumentInspectors = children.map(toInspector) + override def foldable: Boolean = evaluator.foldable - @transient - private lazy val returnInspector = { -function.initializeAndFoldConstants(argumentInspectors.toArray) - } + override lazy val dataType: DataType = evaluator.dataType Review Comment: ```suggestion override def dataType: DataType = evaluator.dataType ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #40394: [SPARK-42771][SQL] Refactor HiveGenericUDF
cloud-fan commented on code in PR #40394: URL: https://github.com/apache/spark/pull/40394#discussion_r1134995335 ## sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveUDFs.scala: ## @@ -129,58 +129,25 @@ private[hive] class DeferredObjectAdapter(oi: ObjectInspector, dataType: DataTyp private[hive] case class HiveGenericUDF( name: String, funcWrapper: HiveFunctionWrapper, children: Seq[Expression]) extends Expression - with HiveInspectors - with Logging with UserDefinedExpression { override def nullable: Boolean = true - override lazy val deterministic: Boolean = isUDFDeterministic && children.forall(_.deterministic) - - override def foldable: Boolean = -isUDFDeterministic && returnInspector.isInstanceOf[ConstantObjectInspector] - - @transient - lazy val function = funcWrapper.createFunction[GenericUDF]() + override lazy val deterministic: Boolean = evaluator.deterministic - @transient - private lazy val argumentInspectors = children.map(toInspector) + override def foldable: Boolean = evaluator.foldable - @transient - private lazy val returnInspector = { -function.initializeAndFoldConstants(argumentInspectors.toArray) - } + override lazy val dataType: DataType = evaluator.dataType - // Visible for codegen @transient - lazy val unwrapper: Any => Any = unwrapperFor(returnInspector) - - @transient - private lazy val isUDFDeterministic = { -val udfType = function.getClass.getAnnotation(classOf[HiveUDFType]) -udfType != null && udfType.deterministic() && !udfType.stateful() - } - - // Visible for codegen - @transient - lazy val deferredObjects: Array[DeferredObject] = argumentInspectors.zip(children).map { -case (inspect, child) => new DeferredObjectAdapter(inspect, child.dataType) - }.toArray[DeferredObject] - - override lazy val dataType: DataType = inspectorToDataType(returnInspector) + private[hive] lazy val evaluator = new HiveGenericUDFEvaluator(funcWrapper, children) Review Comment: ```suggestion private lazy val evaluator = new HiveGenericUDFEvaluator(funcWrapper, children) ``` unless we need to access it somewhere. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on a diff in pull request #40097: [SPARK-42508][CONNECT][ML] Extract the common .ml classes to `mllib-common`
zhengruifeng commented on code in PR #40097: URL: https://github.com/apache/spark/pull/40097#discussion_r1134995656 ## mllib/core/src/main/scala/org/apache/spark/ml/param/shared/HasExecutionContext.scala: ## @@ -0,0 +1,40 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.param.shared + +import scala.concurrent.ExecutionContext + +import org.apache.spark.util.ThreadUtils + +private[ml] trait HasExecutionContext extends HasParallelism { + + /** + * Create a new execution context with a thread-pool that has a maximum number of threads + * set to the value of [[parallelism]]. If this param is set to 1, a same-thread executor + * will be used to run in serial. + */ + private[ml] def getExecutionContext: ExecutionContext = { +getParallelism match { + case 1 => +ThreadUtils.sameThread + case n => +ExecutionContext.fromExecutorService(ThreadUtils + .newDaemonCachedThreadPool(s"${this.getClass.getSimpleName}-thread-pool", n)) +} + } Review Comment: > Can we also move ThreadUtils to common ? I think it is not suitable to be in `mllib-common` ; it should be in another common module extracted from spark-core. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ueshin commented on pull request #40388: [SPARK-42765][CONNECT][PYTHON] Regulate the import path of `pandas_udf`
ueshin commented on PR #40388: URL: https://github.com/apache/spark/pull/40388#issuecomment-1467120067 I guess we can just with the comment: ```py # The implementation of pandas_udf is embedded in pyspark.sql.function.pandas_udf # for code reuse. from pyspark.sql.functions import pandas_udf as pandas_udf ``` for those who want to import `pyspark.sql.connect.functions`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] WeichenXu123 commented on a diff in pull request #40097: [SPARK-42508][CONNECT][ML] Extract the common .ml classes to `mllib-common`
WeichenXu123 commented on code in PR #40097: URL: https://github.com/apache/spark/pull/40097#discussion_r1134787824 ## connector/connect/client/jvm/src/main/scala/org/apache/spark/ml/Estimator.scala: ## @@ -0,0 +1,97 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml + +import scala.annotation.varargs + +import org.apache.spark.annotation.Since +import org.apache.spark.ml.param.{ParamMap, ParamPair} +import org.apache.spark.sql.Dataset + +/** + * Abstract class for estimators that fit models to data. + */ +abstract class Estimator[M <: Model[M]] extends PipelineStage { Review Comment: For meta algorithm, I think we can move all of them to mllib-common, but we can do it in following PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gerashegalov commented on a diff in pull request #40372: [SPARK-42752][PYSPARK][SQL] Make PySpark exceptions printable during initialization
gerashegalov commented on code in PR #40372: URL: https://github.com/apache/spark/pull/40372#discussion_r1134801751 ## python/pyspark/errors/exceptions/captured.py: ## @@ -65,8 +65,15 @@ def __str__(self) -> str: assert SparkContext._jvm is not None jvm = SparkContext._jvm -sql_conf = jvm.org.apache.spark.sql.internal.SQLConf.get() -debug_enabled = sql_conf.pysparkJVMStacktraceEnabled() + +# SPARK-42752: default to True to see issues with initialization +debug_enabled = True +try: +sql_conf = jvm.org.apache.spark.sql.internal.SQLConf.get() +debug_enabled = sql_conf.pysparkJVMStacktraceEnabled() +except BaseException: Review Comment: I advocate for keeping the likelihood of an unhelpful unprintable exception during initialization to the minimum. I would not want to revisit the issue for other runtime exceptions. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] xinrong-meng commented on pull request #40388: [SPARK-42765][CONNECT][PYTHON] Regulate the import path of `pandas_udf`
xinrong-meng commented on PR #40388: URL: https://github.com/apache/spark/pull/40388#issuecomment-1467281968 We didn't wrap `pyspark.sql.function.pandas_udf` with `try_remote_functions`, so `"PYSPARK_NO_NAMESPACE_SHARE"` should be irrelevant. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] wangyum commented on pull request #40360: [SPARK-42741][SQL] Do not unwrap casts in binary comparison when literal is null
wangyum commented on PR #40360: URL: https://github.com/apache/spark/pull/40360#issuecomment-1467294557 cc @cloud-fan -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] otterc commented on pull request #40393: []SPARK-40082]
otterc commented on PR #40393: URL: https://github.com/apache/spark/pull/40393#issuecomment-1467314593 @Stove-hust Thank you for reporting and the patch. Would you be able to share driver logs? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ueshin commented on pull request #40388: [SPARK-42765][CONNECT][PYTHON] Regulate the import path of `pandas_udf`
ueshin commented on PR #40388: URL: https://github.com/apache/spark/pull/40388#issuecomment-1467349620 It's irrelevant means it's an issue, no? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] WeichenXu123 commented on a diff in pull request #40097: [SPARK-42508][CONNECT][ML] Extract the common .ml classes to `mllib-common`
WeichenXu123 commented on code in PR #40097: URL: https://github.com/apache/spark/pull/40097#discussion_r1134946826 ## mllib/core/src/main/scala/org/apache/spark/ml/param/shared/HasExecutionContext.scala: ## @@ -0,0 +1,40 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.param.shared + +import scala.concurrent.ExecutionContext + +import org.apache.spark.util.ThreadUtils + +private[ml] trait HasExecutionContext extends HasParallelism { + + /** + * Create a new execution context with a thread-pool that has a maximum number of threads + * set to the value of [[parallelism]]. If this param is set to 1, a same-thread executor + * will be used to run in serial. + */ + private[ml] def getExecutionContext: ExecutionContext = { +getParallelism match { + case 1 => +ThreadUtils.sameThread + case n => +ExecutionContext.fromExecutorService(ThreadUtils + .newDaemonCachedThreadPool(s"${this.getClass.getSimpleName}-thread-pool", n)) +} + } Review Comment: Can we also move `ThreadUtils` to common ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] panbingkun commented on a diff in pull request #39949: [SPARK-42386][SQL] Rewrite HiveGenericUDF with Invoke
panbingkun commented on code in PR #39949: URL: https://github.com/apache/spark/pull/39949#discussion_r1133604830 ## sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveUDFs.scala: ## @@ -194,47 +183,52 @@ private[hive] case class HiveGenericUDF( override protected def withNewChildrenInternal(newChildren: IndexedSeq[Expression]): Expression = copy(children = newChildren) - override protected def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = { -val refTerm = ctx.addReferenceObj("this", this) -val childrenEvals = children.map(_.genCode(ctx)) - -val setDeferredObjects = childrenEvals.zipWithIndex.map { Review Comment: I have submitted a new pr: https://github.com/apache/spark/pull/40394 to refactor `HiveGenericUDF`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] WeichenXu123 commented on a diff in pull request #40297: [SPARK-42412][WIP] Initial PR of Spark connect ML
WeichenXu123 commented on code in PR #40297: URL: https://github.com/apache/spark/pull/40297#discussion_r1133666221 ## mllib/src/main/scala/org/apache/spark/ml/param/params.scala: ## @@ -44,8 +45,14 @@ import org.apache.spark.ml.util.Identifiable *See [[ParamValidators]] for factory methods for common validation functions. * @tparam T param value type */ -class Param[T](val parent: String, val name: String, val doc: String, val isValid: T => Boolean) - extends Serializable { +class Param[T: ClassTag]( Review Comment: @zhengruifeng I recall I tried this approach `(implicit paramValueClassTag: ClassTag[T])` before , but it make us hard to get the classTag object. So I prefer current approach. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] shrprasa commented on a diff in pull request #40128: [SPARK-42466][K8S]: Cleanup k8s upload directory when job terminates
shrprasa commented on code in PR #40128: URL: https://github.com/apache/spark/pull/40128#discussion_r1133671423 ## resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/KubernetesClientApplication.scala: ## @@ -143,6 +144,9 @@ private[spark] class Client( logError("Please check \"kubectl auth can-i create [resource]\" first." + " It should be yes. And please also check your feature step implementation.") kubernetesClient.resourceList(preKubernetesResources: _*).delete() +// register shutdownhook for cleaning up the upload dir only Review Comment: Client mode will not help as in that case cleanup will be missed if job submission fails for cluster deploy mode. We are doing some other resource cleanup activities in the catch blocks of this class. So, registering the upload path cleanup should also be ok. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] beliefer closed pull request #39990: [SPARK-42415][SQL] The built-in dialects support OFFSET and paging query.
beliefer closed pull request #39990: [SPARK-42415][SQL] The built-in dialects support OFFSET and paging query. URL: https://github.com/apache/spark/pull/39990 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] beliefer commented on pull request #39990: [SPARK-42415][SQL] The built-in dialects support OFFSET and paging query.
beliefer commented on PR #39990: URL: https://github.com/apache/spark/pull/39990#issuecomment-1465922089 https://github.com/apache/spark/pull/40396 used to replace this one. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] beliefer commented on pull request #40396: [SPARK-42772][SQL] Change the default value of JDBC options about push down to true
beliefer commented on PR #40396: URL: https://github.com/apache/spark/pull/40396#issuecomment-1465922700 ping @huaxingao cc @cloud-fan @sadikovi -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on a diff in pull request #40297: [SPARK-42412][WIP] Initial PR of Spark connect ML
zhengruifeng commented on code in PR #40297: URL: https://github.com/apache/spark/pull/40297#discussion_r1133771185 ## connector/connect/common/src/main/protobuf/spark/connect/ml.proto: ## @@ -0,0 +1,170 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +syntax = 'proto3'; + +package spark.connect; + +import "spark/connect/expressions.proto"; +import "spark/connect/relations.proto"; +import "spark/connect/ml_common.proto"; + +option java_multiple_files = true; +option java_package = "org.apache.spark.connect.proto"; + + +message MlEvaluator { + string name = 1; + MlParams params = 2; + string uid = 3; +} + + +message MlCommand { + oneof ml_command_type { +Fit fit = 1; +FetchModelAttr fetch_model_attr = 2; +FetchModelSummaryAttr fetch_model_summary_attr = 3; +LoadModel load_model = 4; +SaveModel save_model = 5; +Evaluate evaluate = 6; +SaveStage save_stage = 7; +LoadStage load_stage = 8; +SaveEvaluator save_evaluator = 9; +LoadEvaluator load_evaluator = 10; +CopyModel copy_model = 11; + } + + message Fit { +MlStage estimator = 1; +Relation dataset = 2; + } + + message Evaluate { +MlEvaluator evaluator = 1; + } + + message LoadModel { +string name = 1; +string path = 2; + } + + message SaveModel { +int64 model_ref_id = 1; +string path = 2; // saving path +bool overwrite = 3; +map options = 4; // saving options + } + + message LoadStage { +string name = 1; +string path = 2; +MlStage.StageType type = 3; + } + + message SaveStage { +MlStage stage = 1; +string path = 2; // saving path +bool overwrite = 3; +map options = 4; // saving options + } + + message LoadEvaluator { +string name = 1; +string path = 2; + } + + message SaveEvaluator { +MlEvaluator evaluator = 1; +string path = 2; // saving path +bool overwrite = 3; +map options = 4; // saving options + } + + message FetchModelAttr { +int64 model_ref_id = 1; +string name = 2; + } + + message FetchModelSummaryAttr { +int64 model_ref_id = 1; +string name = 2; +MlParams params = 3; + +// Evaluation dataset that it uses to computes +// the summary attribute +// If not set, get attributes from +// model.summary (i.e. the summary on training dataset) +optional Relation evaluation_dataset = 4; + } + + message CopyModel { +int64 model_ref_id = 1; + } +} + + +message MlCommandResponse { + oneof ml_command_response_type { +Expression.Literal literal = 1; +ModelInfo model_info = 2; +Vector vector = 3; Review Comment: looks fine. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] pan3793 commented on pull request #39160: [SPARK-41667][K8S] Expose env var SPARK_DRIVER_POD_NAME in Driver Pod
pan3793 commented on PR #39160: URL: https://github.com/apache/spark/pull/39160#issuecomment-1465973424 cross-refer comments from https://github.com/apache/spark/pull/40392#issuecomment-1465870752 > Your PR tried to add `SPARK_DRIVER_POD_NAME` to Driver Pod to expose it to 3rd party pods. > ... to use Driver Service instead Is the service name a kind of official API to allow 3rd party components to access Spark Driver in K8s? If yes, what about executor? My vision is exposing both driver and executor in a unified way to the log service, and aggregate logs by Pod is much straightforward, just like Yarn does, by container. So my first candidate is Pod Name, the second one is Pod IP. @dongjoon-hyun I do understand we should be careful to add each ENV variable, configuration, etc. If you think the Pod IP is acceptable, then it's sufficient after SPARK-42769. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] jdferreira opened a new pull request, #40398: Update `translate` docblock
jdferreira opened a new pull request, #40398: URL: https://github.com/apache/spark/pull/40398 ### What changes were proposed in this pull request? The documentation for the `translate` SQL function is a bit difficult to parse and understand. I propose the new texting. ### Why are the changes needed? To improve documentation ### Does this PR introduce _any_ user-facing change? I'm not sure, and I don't quite understand if I need to do something to make this documentation change become visible in the online documentation. I'd appreciate help here to improve the PR, if needed. ### How was this patch tested? No tests added or executed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang opened a new pull request, #40395: [SPARK-42770] WIP
LuciferYang opened a new pull request, #40395: URL: https://github.com/apache/spark/pull/40395 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] mridulm commented on pull request #40286: [SPARK-42577][CORE] Add max attempts limitation for stages to avoid potential infinite retry
mridulm commented on PR #40286: URL: https://github.com/apache/spark/pull/40286#issuecomment-1465633950 Merged to master. Thanks for fixing this @ivoson ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] mridulm closed pull request #40286: [SPARK-42577][CORE] Add max attempts limitation for stages to avoid potential infinite retry
mridulm closed pull request #40286: [SPARK-42577][CORE] Add max attempts limitation for stages to avoid potential infinite retry URL: https://github.com/apache/spark/pull/40286 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] wangshengjie123 commented on pull request #40391: [WIP][SPARK-42766][YARN] YarnAllocator filter excluded nodes when launching containers
wangshengjie123 commented on PR #40391: URL: https://github.com/apache/spark/pull/40391#issuecomment-1465682073 I am not sure if we should add a Executor exit code and optimize the RegisterExecutor response message in this pr.In production environment, we found sometimes only filter the exclued node when launching containers does not work as well as we want, because we found maybe driver does not request new executors for a period of time, so the execluded nodes list wont be sent to `YarnAllocator` though `requestTotalExecutorsWithPreferredLocalities`, some executors will failed and count to the failure count. So we add the Executor exit code and optimize the RegisterExecutor response message. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on a diff in pull request #40355: [SPARK-42604][CONNECT] Implement functions.typedlit
zhengruifeng commented on code in PR #40355: URL: https://github.com/apache/spark/pull/40355#discussion_r1133568797 ## connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/PlanGenerationTestSuite.scala: ## @@ -2065,6 +2065,44 @@ class PlanGenerationTestSuite fn.lit(Array(new CalendarInterval(2, 20, 100L), new CalendarInterval(2, 21, 200L } + test("function typedLit") { +simple.select( + fn.typedLit(fn.col("id")), + fn.typedLit('id), + fn.typedLit(1), + fn.typedLit[String](null), + fn.typedLit(true), + fn.typedLit(68.toByte), + fn.typedLit(9872.toShort), + fn.typedLit(-8726532), + fn.typedLit(7834609328726532L), + fn.typedLit(Math.E), + fn.typedLit(-0.8f), + fn.typedLit(BigDecimal(8997620, 5)), + fn.typedLit(BigDecimal(898897667231L, 7).bigDecimal), + fn.typedLit("connect!"), + fn.typedLit('T'), + fn.typedLit(Array.tabulate(10)(i => ('A' + i).toChar)), + fn.typedLit(Array.tabulate(23)(i => (i + 120).toByte)), + fn.typedLit(mutable.WrappedArray.make(Array[Byte](8.toByte, 6.toByte))), + fn.typedLit(null), + fn.typedLit(java.time.LocalDate.of(2020, 10, 10)), + fn.typedLit(Decimal.apply(BigDecimal(8997620, 6))), + fn.typedLit(java.time.Instant.ofEpochMilli(1677155519808L)), + fn.typedLit(new java.sql.Timestamp(12345L)), + fn.typedLit(java.time.LocalDateTime.of(2023, 2, 23, 20, 36)), + fn.typedLit(java.sql.Date.valueOf("2023-02-23")), + fn.typedLit(java.time.Duration.ofSeconds(200L)), + fn.typedLit(java.time.Period.ofDays(100)), + fn.typedLit(new CalendarInterval(2, 20, 100L)), + + // Handle parameterized scala types e.g.: List, Seq and Map. + fn.typedLit(Some(1)), + fn.typedLit(Seq(1, 2, 3)), + fn.typedLit(Map("a" -> 1, "b" -> 2)), + fn.typedLit(("a", 2, 1.0))) Review Comment: what about adding a few nested cases? e.g. array; array; map> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] WeichenXu123 commented on a diff in pull request #40297: [SPARK-42412][WIP] Initial PR of Spark connect ML
WeichenXu123 commented on code in PR #40297: URL: https://github.com/apache/spark/pull/40297#discussion_r1133666221 ## mllib/src/main/scala/org/apache/spark/ml/param/params.scala: ## @@ -44,8 +45,14 @@ import org.apache.spark.ml.util.Identifiable *See [[ParamValidators]] for factory methods for common validation functions. * @tparam T param value type */ -class Param[T](val parent: String, val name: String, val doc: String, val isValid: T => Boolean) - extends Serializable { +class Param[T: ClassTag]( Review Comment: @zhengruifeng I recall I tried this approach `(implicit paramValueClassTag: ClassTag[T])` before , but it makes us hard to get the classTag object. So I prefer current approach. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] pan3793 commented on pull request #40392: [SPARK-42769][K8S] Add `SPARK_DRIVER_POD_IP` env variable to executor pods
pan3793 commented on PR #40392: URL: https://github.com/apache/spark/pull/40392#issuecomment-1465812124 > ... for some executor pods to connect driver pods via IP. Hi @dongjoon-hyun, I think it's quite useful, but in https://github.com/apache/spark/pull/39160#pullrequestreview-1229691638, you left a concern > ... I have a concern. Currently, Apache Spark uses K8s Service entity via [DriverServiceFeatureStep](https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/DriverServiceFeatureStep.scala) to access Spark driver pod in K8s environment. do you still concern that now? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #40395: [SPARK-42770][CONNECT] Add `truncatedTo(ChronoUnit.MICROS)` to make `SQLImplicitsTestSuite` test pass on Linux
LuciferYang commented on PR #40395: URL: https://github.com/apache/spark/pull/40395#issuecomment-1465860908 https://github.com/apache/spark/actions/runs/4318647315/jobs/7537203682 ``` [info] - test implicit encoder resolution *** FAILED *** (1 second, 329 milliseconds) 4429[info] 2023-03-02T23:00:20.404434 did not equal 2023-03-02T23:00:20.404434875 (SQLImplicitsTestSuite.scala:63) 4430[info] org.scalatest.exceptions.TestFailedException: 4431[info] at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472) 4432[info] at org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471) 4433[info] at org.scalatest.Assertions$.newAssertionFailedException(Assertions.scala:1231) 4434[info] at org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:1295) 4435[info] at org.apache.spark.sql.SQLImplicitsTestSuite.testImplicit$1(SQLImplicitsTestSuite.scala:63) 4436[info] at org.apache.spark.sql.SQLImplicitsTestSuite.$anonfun$new$2(SQLImplicitsTestSuite.scala:133) 4437[info] at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) 4438[info] at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) 4439[info] at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) 4440[info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) 4441[info] at org.scalatest.Transformer.apply(Transformer.scala:22) 4442[info] at org.scalatest.Transformer.apply(Transformer.scala:20) 4443[info] at org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:226) [info] at org.scalatest.TestSuite.withFixture(TestSuite.scala:196) 4445[info] at org.scalatest.TestSuite.withFixture$(TestSuite.scala:195) 4446[info] at org.scalatest.funsuite.AnyFunSuite.withFixture(AnyFunSuite.scala:1564) 4447[info] at org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:224) 4448[info] at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:236) 4449[info] at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) 4450[info] at org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:236) 4451[info] at org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:218) 4452[info] at org.scalatest.funsuite.AnyFunSuite.runTest(AnyFunSuite.scala:1564) 4453[info] at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:269) 4454[info] at org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:413) 4455[info] at scala.collection.immutable.List.foreach(List.scala:431) 4456[info] at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) 4457[info] at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:396) 4458[info] at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:475) 4459[info] at org.scalatest.funsuite.AnyFunSuiteLike.runTests(AnyFunSuiteLike.scala:269) 4460[info] at org.scalatest.funsuite.AnyFunSuiteLike.runTests$(AnyFunSuiteLike.scala:268) 4461[info] at org.scalatest.funsuite.AnyFunSuite.runTests(AnyFunSuite.scala:1564) 4462[info] at org.scalatest.Suite.run(Suite.scala:1114) 4463[info] at org.scalatest.Suite.run$(Suite.scala:1096) 4464[info] at org.scalatest.funsuite.AnyFunSuite.org$scalatest$funsuite$AnyFunSuiteLike$$super$run(AnyFunSuite.scala:1564) 4465[info] at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$run$1(AnyFunSuiteLike.scala:273) 4466[info] at org.scalatest.SuperEngine.runImpl(Engine.scala:535) 4467[info] at org.scalatest.funsuite.AnyFunSuiteLike.run(AnyFunSuiteLike.scala:273) 4468[info] at org.scalatest.funsuite.AnyFunSuiteLike.run$(AnyFunSuiteLike.scala:272) 4469[info] at org.apache.spark.sql.SQLImplicitsTestSuite.org$scalatest$BeforeAndAfterAll$$super$run(SQLImplicitsTestSuite.scala:34) 4470[info] at org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:213) 4471[info] at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210) 4472[info] at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208) 4473[info] at org.apache.spark.sql.SQLImplicitsTestSuite.run(SQLImplicitsTestSuite.scala:34) 4474[info] at org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:321) 4475[info] at org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:517) 4476[info] at sbt.ForkMain$Run.lambda$runTest$1(ForkMain.java:413) 4477[info] at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) 4478[info] at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) 4479[info] at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) 4480[info] at java.base/java.lang.Thread.run(Thread.java:833) ```
[GitHub] [spark] dongjoon-hyun commented on pull request #40392: [SPARK-42769][K8S] Add `SPARK_DRIVER_POD_IP` env variable to executor pods
dongjoon-hyun commented on PR #40392: URL: https://github.com/apache/spark/pull/40392#issuecomment-1465870752 @pan3793 . The goal of PR is different from your PR's goal. - Your PR tried to add `SPARK_DRIVER_POD_NAME` to `Driver Pod` to expose it to **3rd party pods**. - This PR aims to add `SPARK_DRIVER_POD_IP` to `Executor Pod` in order to help **internal communications between Spark executors and Spark driver**. In addition, this is a kind of propagation of the information from the driver pod to the executor pods instead of exposing the executor pods' internal information. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #40395: [SPARK-42770][CONNECT] Add `truncatedTo(ChronoUnit.MICROS)` to make `SQLImplicitsTestSuite` in Java 17 daily test GA task pass
LuciferYang commented on PR #40395: URL: https://github.com/apache/spark/pull/40395#issuecomment-1465912943 cc @HyukjinKwon also cc @bjornjorgensen who reported this issue in dev mail list -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] beliefer opened a new pull request, #40396: [SPARK-42772][SQL] Change the default value of JDBC options about push down to true
beliefer opened a new pull request, #40396: URL: https://github.com/apache/spark/pull/40396 ### What changes were proposed in this pull request? Currently, DS V2 pushdown could let JDBC dialect decide to push down `OFFSET`, `LIMIT` and table sample. Because some databases doesn't support one of them, so we should change the default value of these pushdown API false. If one database support the syntax, the JDBC dialect should overwrite the value. We also have a lot of JDBC options about push down, such as `pushDownOffset`. Users could change the option value to allow or disallow push down. ### Why are the changes needed? This PR change all JDBC v2 pushdown options to true and change all the dialect's pushdown API to false. ### Does this PR introduce _any_ user-facing change? 'Yes'. The default behavior of pushdown framework is not push down SQL syntax to JDBC data source. Users could control the pushdown enable or disable with JDBC options about push down, such as `pushDownOffset`. ### How was this patch tested? Test cases updated. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] panbingkun opened a new pull request, #40397: [SPARK-42052][SQL] Codegen Support for HiveSimpleUDF
panbingkun opened a new pull request, #40397: URL: https://github.com/apache/spark/pull/40397 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Add new UT. Pass GA. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] panbingkun commented on a diff in pull request #39949: [SPARK-42386][SQL] Rewrite HiveGenericUDF with Invoke
panbingkun commented on code in PR #39949: URL: https://github.com/apache/spark/pull/39949#discussion_r1133823856 ## sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveUDFs.scala: ## @@ -194,47 +183,52 @@ private[hive] case class HiveGenericUDF( override protected def withNewChildrenInternal(newChildren: IndexedSeq[Expression]): Expression = copy(children = newChildren) - override protected def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = { -val refTerm = ctx.addReferenceObj("this", this) -val childrenEvals = children.map(_.genCode(ctx)) - -val setDeferredObjects = childrenEvals.zipWithIndex.map { Review Comment: Follow it to implement codegen of HiveSimpleUDF: https://github.com/apache/spark/pull/40397 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] EnricoMi commented on pull request #39952: [SPARK-40770][PYTHON][FOLLOW-UP] Improved error messages for mapInPandas for schema mismatch
EnricoMi commented on PR #39952: URL: https://github.com/apache/spark/pull/39952#issuecomment-1465739961 CC @cloud-fan @itholic @zhengruifeng -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org