[GitHub] [spark] zhengruifeng commented on pull request #40402: [SPARK-42020][CONNECT][PYTHON] Support UserDefinedType in Spark Connect

2023-03-13 Thread via GitHub


zhengruifeng commented on PR #40402:
URL: https://github.com/apache/spark/pull/40402#issuecomment-1467212310

   also cc @WeichenXu123 since this PR supports `df.collect` with UDT


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] panbingkun commented on a diff in pull request #40394: [SPARK-42771][SQL] Refactor HiveGenericUDF

2023-03-13 Thread via GitHub


panbingkun commented on code in PR #40394:
URL: https://github.com/apache/spark/pull/40394#discussion_r1134807339


##
sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveUDFs.scala:
##
@@ -235,6 +200,56 @@ private[hive] case class HiveGenericUDF(
   }
 }
 
+class HiveGenericUDFHelper(
+funcWrapper: HiveFunctionWrapper, children: Seq[Expression])
+  extends HiveInspectors
+  with Serializable
+  with Logging {
+
+  @transient
+  private[hive] val deterministic = isUDFDeterministic && 
children.forall(_.deterministic)
+
+  @transient
+  private[hive] val foldable =

Review Comment:
   If we do not define `dataType` `foldable` `deterministic` properties in this 
class, we will have to expose some internal logic, such as `returnInspector`
   https://user-images.githubusercontent.com/15246973/224871989-7be999b4-c229-44d1-9e53-658ecacc539c.png;>
   
   With the similar implementation of HiveSimpleUDF, I found some common 
properties, or I can do some abstraction and reuse later.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng commented on a diff in pull request #40097: [SPARK-42508][CONNECT][ML] Extract the common .ml classes to `mllib-common`

2023-03-13 Thread via GitHub


zhengruifeng commented on code in PR #40097:
URL: https://github.com/apache/spark/pull/40097#discussion_r1134836246


##
mllib/core/src/main/scala/org/apache/spark/ml/param/shared/HasExecutionContext.scala:
##
@@ -0,0 +1,40 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.param.shared
+
+import scala.concurrent.ExecutionContext
+
+import org.apache.spark.util.ThreadUtils
+
+private[ml] trait HasExecutionContext extends HasParallelism {
+
+  /**
+   * Create a new execution context with a thread-pool that has a maximum 
number of threads
+   * set to the value of [[parallelism]]. If this param is set to 1, a 
same-thread executor
+   * will be used to run in serial.
+   */
+  private[ml] def getExecutionContext: ExecutionContext = {
+getParallelism match {
+  case 1 =>
+ThreadUtils.sameThread
+  case n =>
+ExecutionContext.fromExecutorService(ThreadUtils
+  
.newDaemonCachedThreadPool(s"${this.getClass.getSimpleName}-thread-pool", n))
+}
+  }

Review Comment:
   it will make `mllib-common` directly depends on spark core. Let me take 
another look



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sadikovi commented on a diff in pull request #40396: [SPARK-42772][SQL] Change the default value of JDBC options about push down to true

2023-03-13 Thread via GitHub


sadikovi commented on code in PR #40396:
URL: https://github.com/apache/spark/pull/40396#discussion_r1134891054


##
connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/PostgresIntegrationSuite.scala:
##
@@ -49,10 +49,6 @@ class PostgresIntegrationSuite extends 
DockerJDBCIntegrationV2Suite with V2JDBCT
   override def sparkConf: SparkConf = super.sparkConf
 .set("spark.sql.catalog.postgresql", classOf[JDBCTableCatalog].getName)
 .set("spark.sql.catalog.postgresql.url", db.getJdbcUrl(dockerIp, 
externalPort))
-.set("spark.sql.catalog.postgresql.pushDownTableSample", "true")

Review Comment:
   IMHO, I would still keep those configs in the tests, makes it clear that the 
configs need to be enabled in the test but it is fine either way for me.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] itholic commented on pull request #40282: [SPARK-42672][PYTHON][DOCS] Document error class list

2023-03-13 Thread via GitHub


itholic commented on PR #40282:
URL: https://github.com/apache/spark/pull/40282#issuecomment-1467350642

   Reminder for @HyukjinKwon @srielau @MaxGekk for error class document for 
PySpark.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #40395: [SPARK-42770][CONNECT] Add `truncatedTo(ChronoUnit.MICROS)` to make `SQLImplicitsTestSuite` in Java 17 daily test GA task pa

2023-03-13 Thread via GitHub


dongjoon-hyun commented on code in PR #40395:
URL: https://github.com/apache/spark/pull/40395#discussion_r1134938946


##
connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/SQLImplicitsTestSuite.scala:
##
@@ -130,9 +131,15 @@ class SQLImplicitsTestSuite extends ConnectFunSuite with 
BeforeAndAfterAll {
 testImplicit(BigDecimal(decimal))
 testImplicit(Date.valueOf(LocalDate.now()))
 testImplicit(LocalDate.now())
-testImplicit(LocalDateTime.now())
-testImplicit(Instant.now())
-testImplicit(Timestamp.from(Instant.now()))
+// SPARK-42770: Run `LocalDateTime.now()` and `Instant.now()` with Java 8 
& 11 always
+// get microseconds on both Linux and MacOS, but there are some 
differences when
+// using Java 17, it will get accurate nanoseconds on Linux, but still get 
the microseconds
+// on MacOS. At present, Spark always converts them to microseconds, this 
will cause the
+// test fail when using Java 17 on Linux, so add 
`truncatedTo(ChronoUnit.MICROS)` to
+// there to ensure the accuracy of input data is microseconds.
+testImplicit(LocalDateTime.now().truncatedTo(ChronoUnit.MICROS))
+testImplicit(Instant.now().truncatedTo(ChronoUnit.MICROS))
+testImplicit(Timestamp.from(Instant.now().truncatedTo(ChronoUnit.MICROS)))

Review Comment:
   Shall we apply this change for Java 17 only?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on a diff in pull request #40395: [SPARK-42770][CONNECT] Add `truncatedTo(ChronoUnit.MICROS)` to make `SQLImplicitsTestSuite` in Java 17 daily test GA task pass

2023-03-13 Thread via GitHub


LuciferYang commented on code in PR #40395:
URL: https://github.com/apache/spark/pull/40395#discussion_r1135005783


##
connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/SQLImplicitsTestSuite.scala:
##
@@ -130,9 +132,21 @@ class SQLImplicitsTestSuite extends ConnectFunSuite with 
BeforeAndAfterAll {
 testImplicit(BigDecimal(decimal))
 testImplicit(Date.valueOf(LocalDate.now()))
 testImplicit(LocalDate.now())
-testImplicit(LocalDateTime.now())
-testImplicit(Instant.now())
-testImplicit(Timestamp.from(Instant.now()))
+// SPARK-42770: Run `LocalDateTime.now()` and `Instant.now()` with Java 8 
& 11 always
+// get microseconds on both Linux and MacOS, but there are some 
differences when
+// using Java 17, it will get accurate nanoseconds on Linux, but still get 
the microseconds
+// on MacOS. At present, Spark always converts them to microseconds, this 
will cause the
+// test fail when using Java 17 on Linux, so add 
`truncatedTo(ChronoUnit.MICROS)` when
+// testing on Linux using Java 17 to ensure the accuracy of input data is 
microseconds.
+if (SystemUtils.isJavaVersionAtLeast(JavaVersion.JAVA_17) && 
SystemUtils.IS_OS_LINUX) {

Review Comment:
   Checked CentOS+Java 17, it works



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on pull request #18990: [SPARK-21782][Core] Repartition creates skews when numPartitions is a power of 2

2023-03-13 Thread via GitHub


cloud-fan commented on PR #18990:
URL: https://github.com/apache/spark/pull/18990#issuecomment-1467186918

   It should have been fixed in 3.2+: https://github.com/apache/spark/pull/37855


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] wangshengjie123 commented on pull request #40391: [SPARK-42766][YARN] YarnAllocator filter excluded nodes when launching containers

2023-03-13 Thread via GitHub


wangshengjie123 commented on PR #40391:
URL: https://github.com/apache/spark/pull/40391#issuecomment-1467201970

   @Ngone51 @tgravescs could you please help review this pr when you have time, 
thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] panbingkun commented on a diff in pull request #40394: [SPARK-42771][SQL] Refactor HiveGenericUDF

2023-03-13 Thread via GitHub


panbingkun commented on code in PR #40394:
URL: https://github.com/apache/spark/pull/40394#discussion_r1134810496


##
sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveUDFs.scala:
##
@@ -235,6 +200,56 @@ private[hive] case class HiveGenericUDF(
   }
 }
 
+class HiveGenericUDFHelper(
+funcWrapper: HiveFunctionWrapper, children: Seq[Expression])
+  extends HiveInspectors
+  with Serializable
+  with Logging {
+
+  @transient
+  private[hive] val deterministic = isUDFDeterministic && 
children.forall(_.deterministic)
+
+  @transient
+  private[hive] val foldable =
+isUDFDeterministic && returnInspector.isInstanceOf[ConstantObjectInspector]
+
+  @transient
+  private[hive] val dataType: DataType = inspectorToDataType(returnInspector)
+
+  @transient
+  private lazy val function = funcWrapper.createFunction[GenericUDF]()
+
+  @transient
+  private lazy val isUDFDeterministic = {
+val udfType = function.getClass.getAnnotation(classOf[HiveUDFType])
+udfType != null && udfType.deterministic() && !udfType.stateful()
+  }
+
+  @transient
+  private lazy val argumentInspectors = children.map(toInspector)
+
+  @transient
+  private lazy val deferredObjects: Array[DeferredObject] = 
argumentInspectors.zip(children).map {

Review Comment:
   Ok



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #40395: [SPARK-42770][CONNECT] Add `truncatedTo(ChronoUnit.MICROS)` to make `SQLImplicitsTestSuite` in Java 17 daily test GA task pa

2023-03-13 Thread via GitHub


dongjoon-hyun commented on code in PR #40395:
URL: https://github.com/apache/spark/pull/40395#discussion_r1134941041


##
connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/SQLImplicitsTestSuite.scala:
##
@@ -130,9 +131,15 @@ class SQLImplicitsTestSuite extends ConnectFunSuite with 
BeforeAndAfterAll {
 testImplicit(BigDecimal(decimal))
 testImplicit(Date.valueOf(LocalDate.now()))
 testImplicit(LocalDate.now())
-testImplicit(LocalDateTime.now())
-testImplicit(Instant.now())
-testImplicit(Timestamp.from(Instant.now()))
+// SPARK-42770: Run `LocalDateTime.now()` and `Instant.now()` with Java 8 
& 11 always
+// get microseconds on both Linux and MacOS, but there are some 
differences when
+// using Java 17, it will get accurate nanoseconds on Linux, but still get 
the microseconds
+// on MacOS. At present, Spark always converts them to microseconds, this 
will cause the
+// test fail when using Java 17 on Linux, so add 
`truncatedTo(ChronoUnit.MICROS)` to
+// there to ensure the accuracy of input data is microseconds.
+testImplicit(LocalDateTime.now().truncatedTo(ChronoUnit.MICROS))
+testImplicit(Instant.now().truncatedTo(ChronoUnit.MICROS))
+testImplicit(Timestamp.from(Instant.now().truncatedTo(ChronoUnit.MICROS)))

Review Comment:
   Also, it would be great if we can apply new code on Linux environment only, 
@LuciferYang .



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] WeichenXu123 commented on a diff in pull request #40097: [SPARK-42508][CONNECT][ML] Extract the common .ml classes to `mllib-common`

2023-03-13 Thread via GitHub


WeichenXu123 commented on code in PR #40097:
URL: https://github.com/apache/spark/pull/40097#discussion_r1134960384


##
mllib/core/src/test/scala/org/apache/spark/ml/attribute/AttributeGroupSuite.scala:
##
@@ -1,65 +0,0 @@
-/*
- * Licensed to the Apache Software Foundation (ASF) under one or more
- * contributor license agreements.  See the NOTICE file distributed with
- * this work for additional information regarding copyright ownership.
- * The ASF licenses this file to You under the Apache License, Version 2.0
- * (the "License"); you may not use this file except in compliance with
- * the License.  You may obtain a copy of the License at
- *
- *http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-package org.apache.spark.ml.attribute
-
-import org.apache.spark.SparkFunSuite
-
-class AttributeGroupSuite extends SparkFunSuite {
-
-  test("attribute group") {
-val attrs = Array(
-  NumericAttribute.defaultAttr,
-  NominalAttribute.defaultAttr,
-  BinaryAttribute.defaultAttr.withIndex(0),
-  NumericAttribute.defaultAttr.withName("age").withSparsity(0.8),
-  NominalAttribute.defaultAttr.withName("size").withValues("small", 
"medium", "large"),
-  BinaryAttribute.defaultAttr.withName("clicked").withValues("no", "yes"),
-  NumericAttribute.defaultAttr,
-  NumericAttribute.defaultAttr)
-val group = new AttributeGroup("user", attrs)
-assert(group.size === 8)
-assert(group.name === "user")
-assert(group(0) === NumericAttribute.defaultAttr.withIndex(0))
-assert(group(2) === BinaryAttribute.defaultAttr.withIndex(2))
-assert(group.indexOf("age") === 3)
-assert(group.indexOf("size") === 4)
-assert(group.indexOf("clicked") === 5)
-assert(!group.hasAttr("abc"))
-intercept[NoSuchElementException] {
-  group("abc")
-}
-assert(group === AttributeGroup.fromMetadata(group.toMetadataImpl, 
group.name))
-assert(group === AttributeGroup.fromStructField(group.toStructField()))
-  }
-
-  test("attribute group without attributes") {
-val group0 = new AttributeGroup("user", 10)
-assert(group0.name === "user")
-assert(group0.numAttributes === Some(10))
-assert(group0.size === 10)
-assert(group0.attributes.isEmpty)
-assert(group0 === AttributeGroup.fromMetadata(group0.toMetadataImpl, 
group0.name))
-assert(group0 === AttributeGroup.fromStructField(group0.toStructField()))
-
-val group1 = new AttributeGroup("item")
-assert(group1.name === "item")
-assert(group1.numAttributes.isEmpty)
-assert(group1.attributes.isEmpty)
-assert(group1.size === -1)
-  }
-}

Review Comment:
   Similar question for `AttributeSuite.scala` , 
`JsonMatrixConverterSuite.scala` , ..



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] WeichenXu123 commented on a diff in pull request #40097: [SPARK-42508][CONNECT][ML] Extract the common .ml classes to `mllib-common`

2023-03-13 Thread via GitHub


WeichenXu123 commented on code in PR #40097:
URL: https://github.com/apache/spark/pull/40097#discussion_r1134959146


##
mllib/core/src/test/scala/org/apache/spark/ml/attribute/AttributeGroupSuite.scala:
##
@@ -1,65 +0,0 @@
-/*
- * Licensed to the Apache Software Foundation (ASF) under one or more
- * contributor license agreements.  See the NOTICE file distributed with
- * this work for additional information regarding copyright ownership.
- * The ASF licenses this file to You under the Apache License, Version 2.0
- * (the "License"); you may not use this file except in compliance with
- * the License.  You may obtain a copy of the License at
- *
- *http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-package org.apache.spark.ml.attribute
-
-import org.apache.spark.SparkFunSuite
-
-class AttributeGroupSuite extends SparkFunSuite {
-
-  test("attribute group") {
-val attrs = Array(
-  NumericAttribute.defaultAttr,
-  NominalAttribute.defaultAttr,
-  BinaryAttribute.defaultAttr.withIndex(0),
-  NumericAttribute.defaultAttr.withName("age").withSparsity(0.8),
-  NominalAttribute.defaultAttr.withName("size").withValues("small", 
"medium", "large"),
-  BinaryAttribute.defaultAttr.withName("clicked").withValues("no", "yes"),
-  NumericAttribute.defaultAttr,
-  NumericAttribute.defaultAttr)
-val group = new AttributeGroup("user", attrs)
-assert(group.size === 8)
-assert(group.name === "user")
-assert(group(0) === NumericAttribute.defaultAttr.withIndex(0))
-assert(group(2) === BinaryAttribute.defaultAttr.withIndex(2))
-assert(group.indexOf("age") === 3)
-assert(group.indexOf("size") === 4)
-assert(group.indexOf("clicked") === 5)
-assert(!group.hasAttr("abc"))
-intercept[NoSuchElementException] {
-  group("abc")
-}
-assert(group === AttributeGroup.fromMetadata(group.toMetadataImpl, 
group.name))
-assert(group === AttributeGroup.fromStructField(group.toStructField()))
-  }
-
-  test("attribute group without attributes") {
-val group0 = new AttributeGroup("user", 10)
-assert(group0.name === "user")
-assert(group0.numAttributes === Some(10))
-assert(group0.size === 10)
-assert(group0.attributes.isEmpty)
-assert(group0 === AttributeGroup.fromMetadata(group0.toMetadataImpl, 
group0.name))
-assert(group0 === AttributeGroup.fromStructField(group0.toStructField()))
-
-val group1 = new AttributeGroup("item")
-assert(group1.name === "item")
-assert(group1.numAttributes.isEmpty)
-assert(group1.attributes.isEmpty)
-assert(group1.size === -1)
-  }
-}

Review Comment:
   Q: This file should be moved to ml-common, but why the git diff does not 
show it as a "a --> b" ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ueshin commented on pull request #40388: [SPARK-42765][CONNECT][PYTHON] Regulate the import path of `pandas_udf`

2023-03-13 Thread via GitHub


ueshin commented on PR #40388:
URL: https://github.com/apache/spark/pull/40388#issuecomment-1467126454

   Btw, what happens if `"PYSPARK_NO_NAMESPACE_SHARE" in os.environ`?
   
   
https://github.com/apache/spark/blob/761e0c0f6f0d00733177a869b9ecdf454e13fc9f/python/pyspark/sql/utils.py#L148-L161
   
   Usually the env var is set, we need to explicitly import 
`pyspark.sql.connect.function`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] liang3zy22 closed pull request #40347: [SPARK-42711][BUILD]Update usage info and shellcheck warn/error fix for build/sbt tool

2023-03-13 Thread via GitHub


liang3zy22 closed pull request #40347: [SPARK-42711][BUILD]Update usage info 
and shellcheck warn/error fix for build/sbt tool
URL: https://github.com/apache/spark/pull/40347


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] liang3zy22 commented on pull request #40347: [SPARK-42711][BUILD]Update usage info and shellcheck warn/error fix for build/sbt tool

2023-03-13 Thread via GitHub


liang3zy22 commented on PR #40347:
URL: https://github.com/apache/spark/pull/40347#issuecomment-1467133558

   Yes, this PR is kind of useless. I close it now.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] WeichenXu123 commented on a diff in pull request #40097: [SPARK-42508][CONNECT][ML] Extract the common .ml classes to `mllib-common`

2023-03-13 Thread via GitHub


WeichenXu123 commented on code in PR #40097:
URL: https://github.com/apache/spark/pull/40097#discussion_r1134787382


##
connector/connect/client/jvm/src/main/scala/org/apache/spark/ml/Estimator.scala:
##
@@ -0,0 +1,97 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml
+
+import scala.annotation.varargs
+
+import org.apache.spark.annotation.Since
+import org.apache.spark.ml.param.{ParamMap, ParamPair}
+import org.apache.spark.sql.Dataset
+
+/**
+ * Abstract class for estimators that fit models to data.
+ */
+abstract class Estimator[M <: Model[M]] extends PipelineStage {

Review Comment:
   Similar question for Classifier, ProbabilisticClassifier



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #40394: [SPARK-42771][SQL] Refactor HiveGenericUDF

2023-03-13 Thread via GitHub


cloud-fan commented on code in PR #40394:
URL: https://github.com/apache/spark/pull/40394#discussion_r1134829857


##
sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveUDFs.scala:
##
@@ -191,18 +157,18 @@ private[hive] case class HiveGenericUDF(
 
   override protected def withNewChildrenInternal(newChildren: 
IndexedSeq[Expression]): Expression =
 copy(children = newChildren)
-  override protected def doGenCode(ctx: CodegenContext, ev: ExprCode): 
ExprCode = {
+
+  protected def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = {
 val refTerm = ctx.addReferenceObj("this", this)

Review Comment:
   do we still reference the entire `this`? Can we just reference the new 
evalautor?



##
sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveUDFs.scala:
##
@@ -191,18 +157,18 @@ private[hive] case class HiveGenericUDF(
 
   override protected def withNewChildrenInternal(newChildren: 
IndexedSeq[Expression]): Expression =
 copy(children = newChildren)
-  override protected def doGenCode(ctx: CodegenContext, ev: ExprCode): 
ExprCode = {
+
+  protected def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = {
 val refTerm = ctx.addReferenceObj("this", this)

Review Comment:
   do we still reference the entire `this`? Can we just reference the new 
evaluator?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng closed pull request #40401: [SPARK-42773][DOCS][PYTHON] Minor update to 3.4.0 version change message for Spark Connect

2023-03-13 Thread via GitHub


zhengruifeng closed pull request #40401: [SPARK-42773][DOCS][PYTHON] Minor 
update to 3.4.0 version change message for Spark Connect
URL: https://github.com/apache/spark/pull/40401


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng commented on pull request #40401: [SPARK-42773][DOCS][PYTHON] Minor update to 3.4.0 version change message for Spark Connect

2023-03-13 Thread via GitHub


zhengruifeng commented on PR #40401:
URL: https://github.com/apache/spark/pull/40401#issuecomment-1467252638

   thank you @allanf-db , merged into master/branch-3.4


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] atronchi commented on pull request #18990: [SPARK-21782][Core] Repartition creates skews when numPartitions is a power of 2

2023-03-13 Thread via GitHub


atronchi commented on PR #18990:
URL: https://github.com/apache/spark/pull/18990#issuecomment-1467104804

   This issue appears to remain at large for the dataframe API which is used 
more broadly than RDD. What would it take to extend the fix to the dataframe 
API? 
   
   I verified this on Spark 3.2 using `df.repartition(1024)` on a dataframe 
with ~200k rows, which resulted in almost 30% EMPTY partitions, and the below 
shown skew of the remaining non-empty ones.
   https://user-images.githubusercontent.com/4906224/224852367-a33564c8-21f2-4daa-8da2-cbca0389594a.png;>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ueshin commented on a diff in pull request #40402: [SPARK-42020][CONNECT][PYTHON] Support UserDefinedType in Spark Connect

2023-03-13 Thread via GitHub


ueshin commented on code in PR #40402:
URL: https://github.com/apache/spark/pull/40402#discussion_r1134750091


##
connector/connect/common/src/main/protobuf/spark/connect/base.proto:
##
@@ -272,6 +272,9 @@ message ExecutePlanResponse {
   // The metrics observed during the execution of the query plan.
   repeated ObservedMetrics observed_metrics = 6;
 
+  // The Spark schema
+  DataType schema = 7;

Review Comment:
   @grundprinzip The actual Spark data type is necessary to rebuild the UDT 
objects.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ueshin commented on a diff in pull request #40402: [SPARK-42020][CONNECT][PYTHON] Support UserDefinedType in Spark Connect

2023-03-13 Thread via GitHub


ueshin commented on code in PR #40402:
URL: https://github.com/apache/spark/pull/40402#discussion_r1134750091


##
connector/connect/common/src/main/protobuf/spark/connect/base.proto:
##
@@ -272,6 +272,9 @@ message ExecutePlanResponse {
   // The metrics observed during the execution of the query plan.
   repeated ObservedMetrics observed_metrics = 6;
 
+  // The Spark schema
+  DataType schema = 7;

Review Comment:
   @grundprinzip The actual Spark data type in the execution result is 
necessary to rebuild the UDT objects.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] WeichenXu123 commented on a diff in pull request #40097: [SPARK-42508][CONNECT][ML] Extract the common .ml classes to `mllib-common`

2023-03-13 Thread via GitHub


WeichenXu123 commented on code in PR #40097:
URL: https://github.com/apache/spark/pull/40097#discussion_r1134776526


##
connector/connect/client/jvm/src/main/scala/org/apache/spark/ml/Estimator.scala:
##
@@ -0,0 +1,97 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml
+
+import scala.annotation.varargs
+
+import org.apache.spark.annotation.Since
+import org.apache.spark.ml.param.{ParamMap, ParamPair}
+import org.apache.spark.sql.Dataset
+
+/**
+ * Abstract class for estimators that fit models to data.
+ */
+abstract class Estimator[M <: Model[M]] extends PipelineStage {

Review Comment:
   Similar question for Model / Transformer / Predictor / Pipeline



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] WeichenXu123 commented on a diff in pull request #40097: [SPARK-42508][CONNECT][ML] Extract the common .ml classes to `mllib-common`

2023-03-13 Thread via GitHub


WeichenXu123 commented on code in PR #40097:
URL: https://github.com/apache/spark/pull/40097#discussion_r1134777302


##
connector/connect/client/jvm/src/main/scala/org/apache/spark/ml/feature/LabeledPoint.scala:
##
@@ -0,0 +1,41 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import org.apache.spark.annotation.Since
+import org.apache.spark.ml.linalg.Vector
+
+/**
+ * Class that represents the features and label of a data point.
+ *
+ * @param label
+ *   Label for this data point.
+ * @param features
+ *   List of features for this data point.
+ */
+@Since("3.5.0")
+case class LabeledPoint(label: Double, features: Vector) {

Review Comment:
   can we also move this to mllib-common ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] WeichenXu123 commented on pull request #40097: [SPARK-42508][CONNECT][ML] Extract the common .ml classes to `mllib-common`

2023-03-13 Thread via GitHub


WeichenXu123 commented on PR #40097:
URL: https://github.com/apache/spark/pull/40097#issuecomment-1467213725

   For the 2 exceptions:
   > 2, org.apache.spark.ml.linalg.* except VectorUDTSuite due to cyclical 
dependency; (it copies the VectorUDTSuite except test("JavaTypeInference with 
VectorUDT"))
   > 3, org.apache.spark.ml.param.* except ParamsSuite due to cyclical 
dependency; (it copies the ParamsSuite except test("Filtering ParamMap"))
   
   
   You can move "JavaTypeInference with VectorUDT" out of VectorUDTSuite, and 
move "Filtering ParamMap" out of ParamsSuite, and then can move remaining test 
code without duplication.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ueshin commented on a diff in pull request #40402: [SPARK-42020][CONNECT][PYTHON] Support UserDefinedType in Spark Connect

2023-03-13 Thread via GitHub


ueshin commented on code in PR #40402:
URL: https://github.com/apache/spark/pull/40402#discussion_r1134793723


##
connector/connect/common/src/main/protobuf/spark/connect/base.proto:
##
@@ -272,6 +272,9 @@ message ExecutePlanResponse {
   // The metrics observed during the execution of the query plan.
   repeated ObservedMetrics observed_metrics = 6;
 
+  // The Spark schema
+  DataType schema = 7;

Review Comment:
   Yes, it's only for `df.collect` for now.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on pull request #40385: [SPARK-42753] ReusedExchange refers to non-existent nodes

2023-03-13 Thread via GitHub


cloud-fan commented on PR #40385:
URL: https://github.com/apache/spark/pull/40385#issuecomment-1467245909

   Yea AQE may remove materialized query stages due to optimizations like empty 
relation propagation, but I think it's fine as the shuffle files are still 
there (we don't unregister the shuffle), so the reused shuffle operator can 
still read these shuffle files using the shuffle id. The problem with EXPLAIN 
is it only looks for the referenced exchange in the query plan tree, I think we 
can also look up from the AQE stage cache map?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng commented on a diff in pull request #40402: [SPARK-42020][CONNECT][PYTHON] Support UserDefinedType in Spark Connect

2023-03-13 Thread via GitHub


zhengruifeng commented on code in PR #40402:
URL: https://github.com/apache/spark/pull/40402#discussion_r1134827687


##
python/pyspark/sql/connect/dataframe.py:
##
@@ -1344,9 +1344,9 @@ def collect(self) -> List[Row]:
 if self._session is None:
 raise Exception("Cannot collect on empty session.")
 query = self._plan.to_proto(self._session.client)
-table = self._session.client.to_table(query)
+table, schema = self._session.client.to_table(query)
 
-schema = from_arrow_schema(table.schema)
+schema = schema or from_arrow_schema(table.schema)

Review Comment:
   It seems that MapType's `valueContainsNull` is stored in `to_arrow_type` but 
discarded in `from_arrow_type`.
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng commented on pull request #40097: [SPARK-42508][CONNECT][ML] Extract the common .ml classes to `mllib-common`

2023-03-13 Thread via GitHub


zhengruifeng commented on PR #40097:
URL: https://github.com/apache/spark/pull/40097#issuecomment-1467297939

   > For the 2 exceptions:
   > 
   > > 2, org.apache.spark.ml.linalg.* except VectorUDTSuite due to cyclical 
dependency; (it copies the VectorUDTSuite except test("JavaTypeInference with 
VectorUDT"))
   > > 3, org.apache.spark.ml.param.* except ParamsSuite due to cyclical 
dependency; (it copies the ParamsSuite except test("Filtering ParamMap"))
   > 
   > You can move "JavaTypeInference with VectorUDT" out of VectorUDTSuite, and 
move "Filtering ParamMap" out of ParamsSuite, and then can move remaining test 
code without duplication.
   
   good point, done


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gatorsmile commented on pull request #40216: [SPARK-42593][PS] Deprecate & remove the APIs that will be removed in pandas 2.0.

2023-03-13 Thread via GitHub


gatorsmile commented on PR #40216:
URL: https://github.com/apache/spark/pull/40216#issuecomment-1467332483

   Let us mention all the breaking changes and deprecation in both release 
notes and migration guides 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] panbingkun commented on a diff in pull request #40394: [SPARK-42771][SQL] Refactor HiveGenericUDF

2023-03-13 Thread via GitHub


panbingkun commented on code in PR #40394:
URL: https://github.com/apache/spark/pull/40394#discussion_r1134790402


##
sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveUDFs.scala:
##
@@ -235,6 +200,56 @@ private[hive] case class HiveGenericUDF(
   }
 }
 
+class HiveGenericUDFHelper(

Review Comment:
   OK



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng commented on a diff in pull request #40097: [SPARK-42508][CONNECT][ML] Extract the common .ml classes to `mllib-common`

2023-03-13 Thread via GitHub


zhengruifeng commented on code in PR #40097:
URL: https://github.com/apache/spark/pull/40097#discussion_r1134799615


##
connector/connect/client/jvm/src/main/scala/org/apache/spark/ml/Estimator.scala:
##
@@ -0,0 +1,97 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml
+
+import scala.annotation.varargs
+
+import org.apache.spark.annotation.Since
+import org.apache.spark.ml.param.{ParamMap, ParamPair}
+import org.apache.spark.sql.Dataset
+
+/**
+ * Abstract class for estimators that fit models to data.
+ */
+abstract class Estimator[M <: Model[M]] extends PipelineStage {

Review Comment:
   The reason that I didn't add `Transformer` (as well as its subclasses) into 
the `mllib-common` is to control the dependency:
   
   ```
   def transform(dataset: Dataset[_], paramMap: ParamMap): DataFrame
   ```
   
   `Transformer` depends on `Dataset`, I think it will not be involved in the 
catalyst common module
   
   cc @hvanhovell 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ueshin commented on a diff in pull request #40402: [SPARK-42020][CONNECT][PYTHON] Support UserDefinedType in Spark Connect

2023-03-13 Thread via GitHub


ueshin commented on code in PR #40402:
URL: https://github.com/apache/spark/pull/40402#discussion_r1134799357


##
python/pyspark/sql/connect/dataframe.py:
##
@@ -1344,9 +1344,9 @@ def collect(self) -> List[Row]:
 if self._session is None:
 raise Exception("Cannot collect on empty session.")
 query = self._plan.to_proto(self._session.client)
-table = self._session.client.to_table(query)
+table, schema = self._session.client.to_table(query)
 
-schema = from_arrow_schema(table.schema)
+schema = schema or from_arrow_schema(table.schema)

Review Comment:
   Btw, for the collections, we can retrieve nullability from the schema.
   
   For example:
   
   ```
   if pa.is_list(at):
 field = at.value_field
 ArrayType(from_arrow_type(field.type), containsNull=field.nullable)
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] StevenChenDatabricks commented on pull request #40385: [SPARK-42753] ReusedExchange refers to non-existent nodes

2023-03-13 Thread via GitHub


StevenChenDatabricks commented on PR #40385:
URL: https://github.com/apache/spark/pull/40385#issuecomment-1467287615

   @cloud-fan Yes this is purely UI and EXPLAIN issue. It does not affect query 
execution. 
   
   I'm not sure how AQE context stageCache map would help. The issue in EXPLAIN 
is that the ReusedExchange.child references a Exchange node that is not 
referenced anywhere else in the plan tree so we need to generate IDs on the 
subtree rooted at ReusedExchange.child and print them out. To do this, we need 
a way to check whether the ReusedExchange.child is referenced anywhere else - 
if they are not referenced anywhere else, we need to recursively generate IDs 
for subtree. I keep a HashSet of nodes with IDs already generated and check 
ReusedExchange.child against it to see if we need to recursively generate IDs 
on the subtree. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] github-actions[bot] closed pull request #38887: [SPARK-41368][SQL] Reorder the window partition expressions by expression stats

2023-03-13 Thread via GitHub


github-actions[bot] closed pull request #38887: [SPARK-41368][SQL] Reorder the 
window partition expressions by expression stats
URL: https://github.com/apache/spark/pull/38887


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] WeichenXu123 commented on a diff in pull request #40097: [SPARK-42508][CONNECT][ML] Extract the common .ml classes to `mllib-common`

2023-03-13 Thread via GitHub


WeichenXu123 commented on code in PR #40097:
URL: https://github.com/apache/spark/pull/40097#discussion_r1134775534


##
connector/connect/client/jvm/src/main/scala/org/apache/spark/ml/Estimator.scala:
##
@@ -0,0 +1,97 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml
+
+import scala.annotation.varargs
+
+import org.apache.spark.annotation.Since
+import org.apache.spark.ml.param.{ParamMap, ParamPair}
+import org.apache.spark.sql.Dataset
+
+/**
+ * Abstract class for estimators that fit models to data.
+ */
+abstract class Estimator[M <: Model[M]] extends PipelineStage {

Review Comment:
   Q: Why not move this to mllib-common ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] xinrong-meng opened a new pull request, #40405: [WIP][SPARK-42340][CONNECT][PYTHON] Implement `GroupedData.applyInPandas`

2023-03-13 Thread via GitHub


xinrong-meng opened a new pull request, #40405:
URL: https://github.com/apache/spark/pull/40405

   - [ ] Parity tests
   
   ### What changes were proposed in this pull request?
   Implement `GroupedData.applyInPandas`.
   
   ### Why are the changes needed?
   Parity with vanilla PySpark.
   
   
   ### Does this PR introduce _any_ user-facing change?
   Yes. `GroupedData.applyInPandas` is supported now.
   
   
   ### How was this patch tested?
   Unit tests.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ulysses-you opened a new pull request, #40406: [SPARK-42101][SQL][FOLLOWUP] Improve TableCacheQueryStage with CoalesceShufflePartitions

2023-03-13 Thread via GitHub


ulysses-you opened a new pull request, #40406:
URL: https://github.com/apache/spark/pull/40406

   
   
   ### What changes were proposed in this pull request?
   
   `CoalesceShufflePartitions` should make sure all leaves are 
`ExchangeQueryStageExec` to avoid collect `TableCacheQueryStage`. As we can not 
change the partition number of IMR.
   
   Add two tests to make sure `CoalesceShufflePartitions` works well with 
`TableCacheQueryStage`. Note that, these two tests work without this pr, thanks 
to `ValidateRequirements` the wrong plan has been reverted.
   
   ### Why are the changes needed?
   
   Avoid potential issue.
   
   ### Does this PR introduce _any_ user-facing change?
   
   no
   
   ### How was this patch tested?
   
   add test


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] pan3793 commented on pull request #39160: [SPARK-41667][K8S] Expose env var SPARK_DRIVER_POD_NAME in Driver Pod

2023-03-13 Thread via GitHub


pan3793 commented on PR #39160:
URL: https://github.com/apache/spark/pull/39160#issuecomment-1467341858

   I found that 
[apple/batch-processing-gateway](https://github.com/apple/batch-processing-gateway)
 uses Pod Name to fetch the log as well
   
   
https://github.com/apple/batch-processing-gateway/blob/955d6a6f6dd0adf8e886cad0e5ff926e87697101/src/main/java/com/apple/spark/rest/ApplicationGetLogRest.java#L327


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on a diff in pull request #40395: [SPARK-42770][CONNECT] Add `truncatedTo(ChronoUnit.MICROS)` to make `SQLImplicitsTestSuite` in Java 17 daily test GA task pass

2023-03-13 Thread via GitHub


LuciferYang commented on code in PR #40395:
URL: https://github.com/apache/spark/pull/40395#discussion_r1134966350


##
connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/SQLImplicitsTestSuite.scala:
##
@@ -130,9 +132,21 @@ class SQLImplicitsTestSuite extends ConnectFunSuite with 
BeforeAndAfterAll {
 testImplicit(BigDecimal(decimal))
 testImplicit(Date.valueOf(LocalDate.now()))
 testImplicit(LocalDate.now())
-testImplicit(LocalDateTime.now())
-testImplicit(Instant.now())
-testImplicit(Timestamp.from(Instant.now()))
+// SPARK-42770: Run `LocalDateTime.now()` and `Instant.now()` with Java 8 
& 11 always
+// get microseconds on both Linux and MacOS, but there are some 
differences when
+// using Java 17, it will get accurate nanoseconds on Linux, but still get 
the microseconds
+// on MacOS. At present, Spark always converts them to microseconds, this 
will cause the
+// test fail when using Java 17 on Linux, so add 
`truncatedTo(ChronoUnit.MICROS)` when
+// testing on Linux using Java 17 to ensure the accuracy of input data is 
microseconds.
+if (SystemUtils.isJavaVersionAtLeast(JavaVersion.JAVA_17) && 
SystemUtils.IS_OS_LINUX) {

Review Comment:
   @dongjoon-hyun like this? Let me double check the new change on Linux & Java 
17



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] itholic commented on pull request #40401: [SPARK-42773][DOCS][PYTHON] Minor update to 3.4.0 version change message for Spark Connect

2023-03-13 Thread via GitHub


itholic commented on PR #40401:
URL: https://github.com/apache/spark/pull/40401#issuecomment-1467125916

   Looks good otherwise


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] itholic commented on pull request #40401: [SPARK-42773][DOCS][PYTHON] Minor update to 3.4.0 version change message for Spark Connect

2023-03-13 Thread via GitHub


itholic commented on PR #40401:
URL: https://github.com/apache/spark/pull/40401#issuecomment-1467125777

   Seems like there are some more unchanged docstrings in several files as 
below:
   ```
   spark % git grep "Support Spark Connect"
   python/pyspark/sql/column.py:Support Spark Connect.
   python/pyspark/sql/column.py:Support Spark Connect.
   python/pyspark/sql/dataframe.py:Support Spark Connect.
   python/pyspark/sql/dataframe.py:Support Spark Connect.
   python/pyspark/sql/udf.py:Support Spark Connect.
   python/pyspark/sql/udf.py:Support Spark Connect.
   ```
   Btw, you can change the all strings at once by running the shell command in 
terminal such as:
   ```shell
   find . -name "*.py" -exec perl -pi -e 's/Support Spark Connect/Supports 
Spark Connect/g' {} \;
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] wangyum closed pull request #40190: [SPARK-42597][SQL] Support unwrap date type to timestamp type

2023-03-13 Thread via GitHub


wangyum closed pull request #40190: [SPARK-42597][SQL] Support unwrap date type 
to timestamp type
URL: https://github.com/apache/spark/pull/40190


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] wangyum commented on pull request #40190: [SPARK-42597][SQL] Support unwrap date type to timestamp type

2023-03-13 Thread via GitHub


wangyum commented on PR #40190:
URL: https://github.com/apache/spark/pull/40190#issuecomment-1467139918

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ueshin commented on a diff in pull request #40402: [SPARK-42020][CONNECT][PYTHON] Support UserDefinedType in Spark Connect

2023-03-13 Thread via GitHub


ueshin commented on code in PR #40402:
URL: https://github.com/apache/spark/pull/40402#discussion_r1134749726


##
connector/connect/common/src/main/protobuf/spark/connect/base.proto:
##


Review Comment:
   @grundprinzip The actual Spark data type is necessary to rebuild the UDT 
objects.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ueshin commented on a diff in pull request #40402: [SPARK-42020][CONNECT][PYTHON] Support UserDefinedType in Spark Connect

2023-03-13 Thread via GitHub


ueshin commented on code in PR #40402:
URL: https://github.com/apache/spark/pull/40402#discussion_r1134796113


##
python/pyspark/sql/connect/dataframe.py:
##
@@ -1344,9 +1344,9 @@ def collect(self) -> List[Row]:
 if self._session is None:
 raise Exception("Cannot collect on empty session.")
 query = self._plan.to_proto(self._session.client)
-table = self._session.client.to_table(query)
+table, schema = self._session.client.to_table(query)
 
-schema = from_arrow_schema(table.schema)
+schema = schema or from_arrow_schema(table.schema)

Review Comment:
   That's interesting and I was thinking about the similar thing, but I didn't 
take the approach because it takes some space in the `RecordBatch` that could 
repeatedly be sent to client. That means the schema space could be huge if 
repeat many times.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #40142: [SPARK-41171][SQL] Infer and push down window limit through window if partitionSpec is empty

2023-03-13 Thread via GitHub


cloud-fan commented on code in PR #40142:
URL: https://github.com/apache/spark/pull/40142#discussion_r1134795703


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala:
##
@@ -130,6 +130,7 @@ abstract class Optimizer(catalogManager: CatalogManager)
 val operatorOptimizationBatch: Seq[Batch] = {
   Batch("Operator Optimization before Inferring Filters", fixedPoint,
 operatorOptimizationRuleSet: _*) ::
+  Batch("Infer window group limit", Once, InferWindowGroupLimit) ::

Review Comment:
   I think this is an unrelated change and is actually a regression. We may 
infer filters and it's better to run window limit optimization as late as 
possible. @beliefer can we revert this change about moving the rule?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ueshin commented on a diff in pull request #40402: [SPARK-42020][CONNECT][PYTHON] Support UserDefinedType in Spark Connect

2023-03-13 Thread via GitHub


ueshin commented on code in PR #40402:
URL: https://github.com/apache/spark/pull/40402#discussion_r1134799357


##
python/pyspark/sql/connect/dataframe.py:
##
@@ -1344,9 +1344,9 @@ def collect(self) -> List[Row]:
 if self._session is None:
 raise Exception("Cannot collect on empty session.")
 query = self._plan.to_proto(self._session.client)
-table = self._session.client.to_table(query)
+table, schema = self._session.client.to_table(query)
 
-schema = from_arrow_schema(table.schema)
+schema = schema or from_arrow_schema(table.schema)

Review Comment:
   Btw, for the collections, we can retrieve nullability from the schema.
   
   For example:
   
   ```py
   if pa.is_list(at):
 field = at.value_field
 ArrayType(from_arrow_type(field.type), containsNull=field.nullable)
   ```
   
   I guess map type also works?
   
   ```py
   if pa.is_map(at):
 valueContainsNull = at.item_field.nullable
   ...
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng commented on a diff in pull request #40097: [SPARK-42508][CONNECT][ML] Extract the common .ml classes to `mllib-common`

2023-03-13 Thread via GitHub


zhengruifeng commented on code in PR #40097:
URL: https://github.com/apache/spark/pull/40097#discussion_r1134881273


##
connector/connect/client/jvm/src/main/scala/org/apache/spark/ml/feature/LabeledPoint.scala:
##
@@ -0,0 +1,41 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import org.apache.spark.annotation.Since
+import org.apache.spark.ml.linalg.Vector
+
+/**
+ * Class that represents the features and label of a data point.
+ *
+ * @param label
+ *   Label for this data point.
+ * @param features
+ *   List of features for this data point.
+ */
+@Since("3.5.0")
+case class LabeledPoint(label: Double, features: Vector) {

Review Comment:
   done, but we can not change the since version then



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gengliangwang commented on pull request #40404: [SPARK-42777][SQL] Support converting TimestampNTZ catalog stats to plan stats

2023-03-13 Thread via GitHub


gengliangwang commented on PR #40404:
URL: https://github.com/apache/spark/pull/40404#issuecomment-1467317784

   merging to master/3.4


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gengliangwang closed pull request #40404: [SPARK-42777][SQL] Support converting TimestampNTZ catalog stats to plan stats

2023-03-13 Thread via GitHub


gengliangwang closed pull request #40404: [SPARK-42777][SQL] Support converting 
TimestampNTZ catalog stats to plan stats
URL: https://github.com/apache/spark/pull/40404


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gatorsmile commented on pull request #40336: [SPARK-42706][SQL][DOCS] Document the Spark SQL error classes in user-facing documentation.

2023-03-13 Thread via GitHub


gatorsmile commented on PR #40336:
URL: https://github.com/apache/spark/pull/40336#issuecomment-1467329403

   @MaxGekk  should we merge it to 3.4?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] WeichenXu123 commented on a diff in pull request #40097: [SPARK-42508][CONNECT][ML] Extract the common .ml classes to `mllib-common`

2023-03-13 Thread via GitHub


WeichenXu123 commented on code in PR #40097:
URL: https://github.com/apache/spark/pull/40097#discussion_r1134945494


##
connector/connect/client/jvm/src/main/scala/org/apache/spark/ml/Estimator.scala:
##
@@ -0,0 +1,97 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml
+
+import scala.annotation.varargs
+
+import org.apache.spark.annotation.Since
+import org.apache.spark.ml.param.{ParamMap, ParamPair}
+import org.apache.spark.sql.Dataset
+
+/**
+ * Abstract class for estimators that fit models to data.
+ */
+abstract class Estimator[M <: Model[M]] extends PipelineStage {

Review Comment:
   Can we change it to
   `def transform(dataset: DataFrame, paramMap: ParamMap): DataFrame` ?
   
   Shall we still support Dataset[_] input ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] WeichenXu123 commented on a diff in pull request #40097: [SPARK-42508][CONNECT][ML] Extract the common .ml classes to `mllib-common`

2023-03-13 Thread via GitHub


WeichenXu123 commented on code in PR #40097:
URL: https://github.com/apache/spark/pull/40097#discussion_r1134961956


##
mllib/core/src/test/scala/org/apache/spark/ml/attribute/AttributeGroupSuite.scala:
##
@@ -1,65 +0,0 @@
-/*
- * Licensed to the Apache Software Foundation (ASF) under one or more
- * contributor license agreements.  See the NOTICE file distributed with
- * this work for additional information regarding copyright ownership.
- * The ASF licenses this file to You under the Apache License, Version 2.0
- * (the "License"); you may not use this file except in compliance with
- * the License.  You may obtain a copy of the License at
- *
- *http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-package org.apache.spark.ml.attribute
-
-import org.apache.spark.SparkFunSuite
-
-class AttributeGroupSuite extends SparkFunSuite {
-
-  test("attribute group") {
-val attrs = Array(
-  NumericAttribute.defaultAttr,
-  NominalAttribute.defaultAttr,
-  BinaryAttribute.defaultAttr.withIndex(0),
-  NumericAttribute.defaultAttr.withName("age").withSparsity(0.8),
-  NominalAttribute.defaultAttr.withName("size").withValues("small", 
"medium", "large"),
-  BinaryAttribute.defaultAttr.withName("clicked").withValues("no", "yes"),
-  NumericAttribute.defaultAttr,
-  NumericAttribute.defaultAttr)
-val group = new AttributeGroup("user", attrs)
-assert(group.size === 8)
-assert(group.name === "user")
-assert(group(0) === NumericAttribute.defaultAttr.withIndex(0))
-assert(group(2) === BinaryAttribute.defaultAttr.withIndex(2))
-assert(group.indexOf("age") === 3)
-assert(group.indexOf("size") === 4)
-assert(group.indexOf("clicked") === 5)
-assert(!group.hasAttr("abc"))
-intercept[NoSuchElementException] {
-  group("abc")
-}
-assert(group === AttributeGroup.fromMetadata(group.toMetadataImpl, 
group.name))
-assert(group === AttributeGroup.fromStructField(group.toStructField()))
-  }
-
-  test("attribute group without attributes") {
-val group0 = new AttributeGroup("user", 10)
-assert(group0.name === "user")
-assert(group0.numAttributes === Some(10))
-assert(group0.size === 10)
-assert(group0.attributes.isEmpty)
-assert(group0 === AttributeGroup.fromMetadata(group0.toMetadataImpl, 
group0.name))
-assert(group0 === AttributeGroup.fromStructField(group0.toStructField()))
-
-val group1 = new AttributeGroup("item")
-assert(group1.name === "item")
-assert(group1.numAttributes.isEmpty)
-assert(group1.attributes.isEmpty)
-assert(group1.size === -1)
-  }
-}

Review Comment:
   Got it, I clicked into the diff for single commit.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] srowen commented on pull request #18990: [SPARK-21782][Core] Repartition creates skews when numPartitions is a power of 2

2023-03-13 Thread via GitHub


srowen commented on PR #18990:
URL: https://github.com/apache/spark/pull/18990#issuecomment-1467131067

   @atronchi what is "df" here? I couldn't reproduce that with a DF of 200K 
simple rows


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] linhongliu-db commented on pull request #40403: [SPARK-42754][SQL][UI] Fix backward compatibility issue in nested SQL execution

2023-03-13 Thread via GitHub


linhongliu-db commented on PR #40403:
URL: https://github.com/apache/spark/pull/40403#issuecomment-1467162461

   cc @JoshRosen @rednaxelafx @xinrong-meng 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng commented on pull request #40401: [SPARK-42773][DOCS][PYTHON] Minor update to 3.4.0 version change message for Spark Connect

2023-03-13 Thread via GitHub


zhengruifeng commented on PR #40401:
URL: https://github.com/apache/spark/pull/40401#issuecomment-1467188490

   LGTM pending CI


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] WeichenXu123 commented on a diff in pull request #40097: [SPARK-42508][CONNECT][ML] Extract the common .ml classes to `mllib-common`

2023-03-13 Thread via GitHub


WeichenXu123 commented on code in PR #40097:
URL: https://github.com/apache/spark/pull/40097#discussion_r1134786529


##
mllib/core/src/main/scala/org/apache/spark/ml/param/shared/HasExecutionContext.scala:
##
@@ -0,0 +1,40 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.param.shared
+
+import scala.concurrent.ExecutionContext
+
+import org.apache.spark.util.ThreadUtils
+
+private[ml] trait HasExecutionContext extends HasParallelism {
+
+  /**
+   * Create a new execution context with a thread-pool that has a maximum 
number of threads
+   * set to the value of [[parallelism]]. If this param is set to 1, a 
same-thread executor
+   * will be used to run in serial.
+   */
+  private[ml] def getExecutionContext: ExecutionContext = {
+getParallelism match {
+  case 1 =>
+ThreadUtils.sameThread
+  case n =>
+ExecutionContext.fromExecutorService(ThreadUtils
+  
.newDaemonCachedThreadPool(s"${this.getClass.getSimpleName}-thread-pool", n))
+}
+  }

Review Comment:
   I prefer to convert this to a helper function and move it to mllib-common.
   and we don't need to add a new trait HasExecutionContext.
   
`HasParallelism` shows it is a parallelism param, but `HasExecutionContext` 
is weird, it inherits HasParallelism , this way makes code confusing



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng commented on a diff in pull request #40402: [SPARK-42020][CONNECT][PYTHON] Support UserDefinedType in Spark Connect

2023-03-13 Thread via GitHub


zhengruifeng commented on code in PR #40402:
URL: https://github.com/apache/spark/pull/40402#discussion_r1134776117


##
connector/connect/common/src/main/protobuf/spark/connect/base.proto:
##
@@ -272,6 +272,9 @@ message ExecutePlanResponse {
   // The metrics observed during the execution of the query plan.
   repeated ObservedMetrics observed_metrics = 6;
 
+  // The Spark schema
+  DataType schema = 7;

Review Comment:
   Is it a optional field only available in `df.collect`?



##
python/pyspark/sql/connect/dataframe.py:
##
@@ -1344,9 +1344,9 @@ def collect(self) -> List[Row]:
 if self._session is None:
 raise Exception("Cannot collect on empty session.")
 query = self._plan.to_proto(self._session.client)
-table = self._session.client.to_table(query)
+table, schema = self._session.client.to_table(query)
 
-schema = from_arrow_schema(table.schema)
+schema = schema or from_arrow_schema(table.schema)

Review Comment:
   when I was implementing the collection I was thinking about this dumb 
question: is it possible to make arrow schema 100% compatible with spark schema 
if we store the different fields in arrow metadata?
   e.g. for a spark `MapType`, store the `valueContainsNull` as a metadata in 
the `pa.schema`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #40394: [SPARK-42771][SQL] Refactor HiveGenericUDF

2023-03-13 Thread via GitHub


cloud-fan commented on code in PR #40394:
URL: https://github.com/apache/spark/pull/40394#discussion_r1134828589


##
sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveUDFs.scala:
##
@@ -235,6 +200,56 @@ private[hive] case class HiveGenericUDF(
   }
 }
 
+class HiveGenericUDFHelper(
+funcWrapper: HiveFunctionWrapper, children: Seq[Expression])
+  extends HiveInspectors
+  with Serializable
+  with Logging {
+
+  @transient
+  private[hive] val deterministic = isUDFDeterministic && 
children.forall(_.deterministic)
+
+  @transient
+  private[hive] val foldable =

Review Comment:
   SGTM, we can probably add a base class `HiveUDFEvaluatorBase`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #40394: [SPARK-42771][SQL] Refactor HiveGenericUDF

2023-03-13 Thread via GitHub


cloud-fan commented on code in PR #40394:
URL: https://github.com/apache/spark/pull/40394#discussion_r1134828589


##
sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveUDFs.scala:
##
@@ -235,6 +200,56 @@ private[hive] case class HiveGenericUDF(
   }
 }
 
+class HiveGenericUDFHelper(
+funcWrapper: HiveFunctionWrapper, children: Seq[Expression])
+  extends HiveInspectors
+  with Serializable
+  with Logging {
+
+  @transient
+  private[hive] val deterministic = isUDFDeterministic && 
children.forall(_.deterministic)
+
+  @transient
+  private[hive] val foldable =

Review Comment:
   SGTM



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on pull request #40395: [SPARK-42770][CONNECT] Add `truncatedTo(ChronoUnit.MICROS)` to make `SQLImplicitsTestSuite` in Java 17 daily test GA task pass

2023-03-13 Thread via GitHub


LuciferYang commented on PR #40395:
URL: https://github.com/apache/spark/pull/40395#issuecomment-1467321228

   friendly ping @dongjoon-hyun 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Stove-hust commented on pull request #40393: []SPARK-40082]

2023-03-13 Thread via GitHub


Stove-hust commented on PR #40393:
URL: https://github.com/apache/spark/pull/40393#issuecomment-1467339828

   > @Stove-hust Thank you for reporting and the patch. Would you be able to 
share driver logs?
   
   **
   --- stage 10 faield 
   22/10/15 10:55:58 WARN task-result-getter-1 TaskSetManager: Lost task 435.1 
in stage 10.0 (TID 6822, zw02-data-hdp-dn21102.mt, executor 101): 
FetchFailed(null, shuffleId=3, mapIndex=-1, mapId=-1, reduceId=435, message=
   org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output 
location for shuffle 3 partition 435
   22/10/15 10:55:58 INFO dag-scheduler-event-loop DAGScheduler: 
ShuffleMapStage 10 (processCmd at CliDriver.java:386) failed in 601.792 s due 
to org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output 
location for shuffle 3 partition 435
   
   --- resubmit stage 10 && parentStage 9
   22/10/15 10:55:58 INFO dag-scheduler-event-loop DAGScheduler: Resubmitting 
ShuffleMapStage 9 (processCmd at CliDriver.java:386) and ShuffleMapStage 10 
(processCmd at CliDriver.java:386) due to fetch failure
   22/10/15 10:55:58 INFO dag-scheduler-event-loop DAGScheduler: Resubmitting 
failed stages
   22/10/15 10:55:58 INFO dag-scheduler-event-loop DAGScheduler: Submitting 
ShuffleMapStage 9 (MapPartitionsRDD[22] at processCmd at CliDriver.java:386), 
which has no missing parents
   22/10/15 10:55:58 INFO dag-scheduler-event-loop DAGScheduler: Push-based 
shuffle disabled for ShuffleMapStage 9 (processCmd at CliDriver.java:386) since 
it is already shuffle merge finalized
   22/10/15 10:55:58 INFO dag-scheduler-event-loop DAGScheduler: Submitting 3 
missing tasks from ShuffleMapStage 9 (MapPartitionsRDD[22] at processCmd at 
CliDriver.java:386) (first 15 tasks are for partitions Vector(98, 372, 690))
   22/10/15 10:55:58 INFO dag-scheduler-event-loop YarnClusterScheduler: Adding 
task set 9.1 with 3 tasks
   
   --- The first stage10 task completes one after another, and 
notifyDriverAboutPushCompletion to end stage 10, and mark finalizeTask, because 
the stage is not in runningStages, so the stage cannot be marked 
shuffleMergeFinalized.
   22/10/15 10:55:58 INFO task-result-getter-0 TaskSetManager: Finished task 
325.0 in stage 10.0 (TID 6166) in 154455 ms on zw02-data-hdp-dn25537.mt 
(executor 117) (494/500)
   22/10/15 10:55:59 WARN task-result-getter-1 TaskSetManager: Lost task 325.1 
in stage 10.0 (TID 6671, zw02-data-hdp-dn23160.mt, executor 47): TaskKilled 
(another attempt succeeded)
   22/10/15 10:56:20 WARN task-result-getter-1 TaskSetManager: Lost task 358.1 
in stage 10.0 (TID 6731, zw02-data-hdp-dn25537.mt, executor 95): TaskKilled 
(another attempt succeeded)
   22/10/15 10:56:20 INFO task-result-getter-1 TaskSetManager: Task 358.1 in 
stage 10.0 (TID 6731) failed, but the task will not be re-executed (either 
because the task failed with a shuffle data fetch failure, so the previous 
stage needs to be re-run, or because a different copy of the task has already 
succeeded).
   
   --- Removed TaskSet 10.0, whose tasks have all completed
   22/10/15 10:56:22 INFO task-result-getter-1 TaskSetManager: Ignoring 
task-finished event for 435.0 in stage 10.0 because task 435 has already 
completed successfully
   22/10/15 10:56:22 INFO task-result-getter-1 YarnClusterScheduler: Removed 
TaskSet 10.0, whose tasks have all completed, from pool 
   
   --- notifyDriverAboutPushCompletion stage 10
   22/10/15 10:56:23 INFO dag-scheduler-event-loop DAGScheduler: 
ShuffleMapStage 10 (processCmd at CliDriver.java:386) scheduled for finalizing 
shuffle merge in 0 s
   22/10/15 10:56:23 INFO shuffle-merge-finalizer-2 DAGScheduler: 
ShuffleMapStage 10 (processCmd at CliDriver.java:386) finalizing the shuffle 
merge with registering merge results set to true
   
   --- stage 9 finished 
   22/10/15 10:57:51 INFO task-result-getter-1 TaskSetManager: Finished task 
2.0 in stage 9.1 (TID 6825) in 112825 ms on zw02-data-hdp-dn25559.mt (executor 
74) (3/3)
   22/10/15 10:57:51 INFO task-result-getter-1 YarnClusterScheduler: Removed 
TaskSet 9.1, whose tasks have all completed, from pool 
   22/10/15 10:57:51 INFO dag-scheduler-event-loop DAGScheduler: 
ShuffleMapStage 9 (processCmd at CliDriver.java:386) finished in 112.832 s
   
   --- resubmit stage 10
   2/10/15 10:57:51 INFO dag-scheduler-event-loop DAGScheduler: looking for 
newly runnable stages
   22/10/15 10:57:51 INFO dag-scheduler-event-loop DAGScheduler: running: 
Set(ShuffleMapStage 11, ShuffleMapStage 8)
   22/10/15 10:57:51 INFO dag-scheduler-event-loop DAGScheduler: waiting: 
Set(ShuffleMapStage 12, ShuffleMapStage 10)
   22/10/15 10:57:51 INFO dag-scheduler-event-loop DAGScheduler: failed: Set()
   22/10/15 10:57:51 INFO dag-scheduler-event-loop DAGScheduler: Submitting 
ShuffleMapStage 10 (MapPartitionsRDD[36] at processCmd at CliDriver.java:386), 
which has no missing parents
   22/10/15 10:57:51 INFO dag-scheduler-event-loop OutputCommitCoordinator: 
Reusing state from 

[GitHub] [spark] Stove-hust commented on pull request #40393: []SPARK-40082]

2023-03-13 Thread via GitHub


Stove-hust commented on PR #40393:
URL: https://github.com/apache/spark/pull/40393#issuecomment-1467340408

   > @Stove-hust Thank you for reporting and the patch. Would you be able to 
share driver logs?
   
   Sure(Add some comments)
   --- stage 10 faield 
   22/10/15 10:55:58 WARN task-result-getter-1 TaskSetManager: Lost task 435.1 
in stage 10.0 (TID 6822, zw02-data-hdp-dn21102.mt, executor 101): 
FetchFailed(null, shuffleId=3, mapIndex=-1, mapId=-1, reduceId=435, message=
   org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output 
location for shuffle 3 partition 435
   22/10/15 10:55:58 INFO dag-scheduler-event-loop DAGScheduler: 
ShuffleMapStage 10 (processCmd at CliDriver.java:386) failed in 601.792 s due 
to org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output 
location for shuffle 3 partition 435
   
   -- resubmit stage 10 && parentStage 9
   22/10/15 10:55:58 INFO dag-scheduler-event-loop DAGScheduler: Resubmitting 
ShuffleMapStage 9 (processCmd at CliDriver.java:386) and ShuffleMapStage 10 
(processCmd at CliDriver.java:386) due to fetch failure
   22/10/15 10:55:58 INFO dag-scheduler-event-loop DAGScheduler: Resubmitting 
failed stages
   22/10/15 10:55:58 INFO dag-scheduler-event-loop DAGScheduler: Submitting 
ShuffleMapStage 9 (MapPartitionsRDD[22] at processCmd at CliDriver.java:386), 
which has no missing parents
   22/10/15 10:55:58 INFO dag-scheduler-event-loop DAGScheduler: Push-based 
shuffle disabled for ShuffleMapStage 9 (processCmd at CliDriver.java:386) since 
it is already shuffle merge finalized
   22/10/15 10:55:58 INFO dag-scheduler-event-loop DAGScheduler: Submitting 3 
missing tasks from ShuffleMapStage 9 (MapPartitionsRDD[22] at processCmd at 
CliDriver.java:386) (first 15 tasks are for partitions Vector(98, 372, 690))
   22/10/15 10:55:58 INFO dag-scheduler-event-loop YarnClusterScheduler: Adding 
task set 9.1 with 3 tasks
   
   -- The first stage10 task completes one after another, and 
notifyDriverAboutPushCompletion to end stage 10, and mark finalizeTask, because 
the stage is not in runningStages, so the stage cannot be marked 
shuffleMergeFinalized.
   22/10/15 10:55:58 INFO task-result-getter-0 TaskSetManager: Finished task 
325.0 in stage 10.0 (TID 6166) in 154455 ms on zw02-data-hdp-dn25537.mt 
(executor 117) (494/500)
   22/10/15 10:55:59 WARN task-result-getter-1 TaskSetManager: Lost task 325.1 
in stage 10.0 (TID 6671, zw02-data-hdp-dn23160.mt, executor 47): TaskKilled 
(another attempt succeeded)
   22/10/15 10:56:20 WARN task-result-getter-1 TaskSetManager: Lost task 358.1 
in stage 10.0 (TID 6731, zw02-data-hdp-dn25537.mt, executor 95): TaskKilled 
(another attempt succeeded)
   22/10/15 10:56:20 INFO task-result-getter-1 TaskSetManager: Task 358.1 in 
stage 10.0 (TID 6731) failed, but the task will not be re-executed (either 
because the task failed with a shuffle data fetch failure, so the previous 
stage needs to be re-run, or because a different copy of the task has already 
succeeded).
   
   --- Removed TaskSet 10.0, whose tasks have all completed
   22/10/15 10:56:22 INFO task-result-getter-1 TaskSetManager: Ignoring 
task-finished event for 435.0 in stage 10.0 because task 435 has already 
completed successfully
   22/10/15 10:56:22 INFO task-result-getter-1 YarnClusterScheduler: Removed 
TaskSet 10.0, whose tasks have all completed, from pool 
   
   --- notifyDriverAboutPushCompletion stage 10
   22/10/15 10:56:23 INFO dag-scheduler-event-loop DAGScheduler: 
ShuffleMapStage 10 (processCmd at CliDriver.java:386) scheduled for finalizing 
shuffle merge in 0 s
   22/10/15 10:56:23 INFO shuffle-merge-finalizer-2 DAGScheduler: 
ShuffleMapStage 10 (processCmd at CliDriver.java:386) finalizing the shuffle 
merge with registering merge results set to true
   
   --- stage 9 finished 
   22/10/15 10:57:51 INFO task-result-getter-1 TaskSetManager: Finished task 
2.0 in stage 9.1 (TID 6825) in 112825 ms on zw02-data-hdp-dn25559.mt (executor 
74) (3/3)
   22/10/15 10:57:51 INFO task-result-getter-1 YarnClusterScheduler: Removed 
TaskSet 9.1, whose tasks have all completed, from pool 
   22/10/15 10:57:51 INFO dag-scheduler-event-loop DAGScheduler: 
ShuffleMapStage 9 (processCmd at CliDriver.java:386) finished in 112.832 s
   
   --- resubmit stage 10
   2/10/15 10:57:51 INFO dag-scheduler-event-loop DAGScheduler: looking for 
newly runnable stages
   22/10/15 10:57:51 INFO dag-scheduler-event-loop DAGScheduler: running: 
Set(ShuffleMapStage 11, ShuffleMapStage 8)
   22/10/15 10:57:51 INFO dag-scheduler-event-loop DAGScheduler: waiting: 
Set(ShuffleMapStage 12, ShuffleMapStage 10)
   22/10/15 10:57:51 INFO dag-scheduler-event-loop DAGScheduler: failed: Set()
   22/10/15 10:57:51 INFO dag-scheduler-event-loop DAGScheduler: Submitting 
ShuffleMapStage 10 (MapPartitionsRDD[36] at processCmd at CliDriver.java:386), 
which has no missing parents
   22/10/15 10:57:51 INFO dag-scheduler-event-loop OutputCommitCoordinator: 

[GitHub] [spark] Stove-hust commented on pull request #40393: []SPARK-40082]

2023-03-13 Thread via GitHub


Stove-hust commented on PR #40393:
URL: https://github.com/apache/spark/pull/40393#issuecomment-1467339346

   > @Stove-hust Thank you for reporting and the patch. Would you be able to 
share driver logs?
   
   sure.
   `# stage 10 faield 
   22/10/15 10:55:58 WARN task-result-getter-1 TaskSetManager: Lost task 435.1 
in stage 10.0 (TID 6822, zw02-data-hdp-dn21102.mt, executor 101): 
FetchFailed(null, shuffleId=3, mapIndex=-1, mapId=-1, reduceId=435, message=
   org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output 
location for shuffle 3 partition 435
   22/10/15 10:55:58 INFO dag-scheduler-event-loop DAGScheduler: 
ShuffleMapStage 10 (processCmd at CliDriver.java:386) failed in 601.792 s due 
to org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output 
location for shuffle 3 partition 435
   
   # resubmit stage 10 && parentStage 9
   22/10/15 10:55:58 INFO dag-scheduler-event-loop DAGScheduler: Resubmitting 
ShuffleMapStage 9 (processCmd at CliDriver.java:386) and ShuffleMapStage 10 
(processCmd at CliDriver.java:386) due to fetch failure
   22/10/15 10:55:58 INFO dag-scheduler-event-loop DAGScheduler: Resubmitting 
failed stages
   22/10/15 10:55:58 INFO dag-scheduler-event-loop DAGScheduler: Submitting 
ShuffleMapStage 9 (MapPartitionsRDD[22] at processCmd at CliDriver.java:386), 
which has no missing parents
   22/10/15 10:55:58 INFO dag-scheduler-event-loop DAGScheduler: Push-based 
shuffle disabled for ShuffleMapStage 9 (processCmd at CliDriver.java:386) since 
it is already shuffle merge finalized
   22/10/15 10:55:58 INFO dag-scheduler-event-loop DAGScheduler: Submitting 3 
missing tasks from ShuffleMapStage 9 (MapPartitionsRDD[22] at processCmd at 
CliDriver.java:386) (first 15 tasks are for partitions Vector(98, 372, 690))
   22/10/15 10:55:58 INFO dag-scheduler-event-loop YarnClusterScheduler: Adding 
task set 9.1 with 3 tasks
   
   # The first stage10 task completes one after another, and 
notifyDriverAboutPushCompletion to end stage 10, and mark finalizeTask, because 
the stage is not in runningStages, so the stage cannot be marked 
shuffleMergeFinalized.
   22/10/15 10:55:58 INFO task-result-getter-0 TaskSetManager: Finished task 
325.0 in stage 10.0 (TID 6166) in 154455 ms on zw02-data-hdp-dn25537.mt 
(executor 117) (494/500)
   22/10/15 10:55:59 WARN task-result-getter-1 TaskSetManager: Lost task 325.1 
in stage 10.0 (TID 6671, zw02-data-hdp-dn23160.mt, executor 47): TaskKilled 
(another attempt succeeded)
   22/10/15 10:56:20 WARN task-result-getter-1 TaskSetManager: Lost task 358.1 
in stage 10.0 (TID 6731, zw02-data-hdp-dn25537.mt, executor 95): TaskKilled 
(another attempt succeeded)
   22/10/15 10:56:20 INFO task-result-getter-1 TaskSetManager: Task 358.1 in 
stage 10.0 (TID 6731) failed, but the task will not be re-executed (either 
because the task failed with a shuffle data fetch failure, so the previous 
stage needs to be re-run, or because a different copy of the task has already 
succeeded).
   
   # Removed TaskSet 10.0, whose tasks have all completed
   22/10/15 10:56:22 INFO task-result-getter-1 TaskSetManager: Ignoring 
task-finished event for 435.0 in stage 10.0 because task 435 has already 
completed successfully
   22/10/15 10:56:22 INFO task-result-getter-1 YarnClusterScheduler: Removed 
TaskSet 10.0, whose tasks have all completed, from pool 
   
   # notifyDriverAboutPushCompletion stage 10
   22/10/15 10:56:23 INFO dag-scheduler-event-loop DAGScheduler: 
ShuffleMapStage 10 (processCmd at CliDriver.java:386) scheduled for finalizing 
shuffle merge in 0 s
   22/10/15 10:56:23 INFO shuffle-merge-finalizer-2 DAGScheduler: 
ShuffleMapStage 10 (processCmd at CliDriver.java:386) finalizing the shuffle 
merge with registering merge results set to true
   
   # stage 9 finished 
   22/10/15 10:57:51 INFO task-result-getter-1 TaskSetManager: Finished task 
2.0 in stage 9.1 (TID 6825) in 112825 ms on zw02-data-hdp-dn25559.mt (executor 
74) (3/3)
   22/10/15 10:57:51 INFO task-result-getter-1 YarnClusterScheduler: Removed 
TaskSet 9.1, whose tasks have all completed, from pool 
   22/10/15 10:57:51 INFO dag-scheduler-event-loop DAGScheduler: 
ShuffleMapStage 9 (processCmd at CliDriver.java:386) finished in 112.832 s
   
   # resubmit stage 10
   2/10/15 10:57:51 INFO dag-scheduler-event-loop DAGScheduler: looking for 
newly runnable stages
   22/10/15 10:57:51 INFO dag-scheduler-event-loop DAGScheduler: running: 
Set(ShuffleMapStage 11, ShuffleMapStage 8)
   22/10/15 10:57:51 INFO dag-scheduler-event-loop DAGScheduler: waiting: 
Set(ShuffleMapStage 12, ShuffleMapStage 10)
   22/10/15 10:57:51 INFO dag-scheduler-event-loop DAGScheduler: failed: Set()
   22/10/15 10:57:51 INFO dag-scheduler-event-loop DAGScheduler: Submitting 
ShuffleMapStage 10 (MapPartitionsRDD[36] at processCmd at CliDriver.java:386), 
which has no missing parents
   22/10/15 10:57:51 INFO dag-scheduler-event-loop OutputCommitCoordinator: 
Reusing state from previous 

[GitHub] [spark] cloud-fan commented on a diff in pull request #40394: [SPARK-42771][SQL] Refactor HiveGenericUDF

2023-03-13 Thread via GitHub


cloud-fan commented on code in PR #40394:
URL: https://github.com/apache/spark/pull/40394#discussion_r1134994989


##
sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveUDFs.scala:
##
@@ -129,58 +129,25 @@ private[hive] class DeferredObjectAdapter(oi: 
ObjectInspector, dataType: DataTyp
 private[hive] case class HiveGenericUDF(
 name: String, funcWrapper: HiveFunctionWrapper, children: Seq[Expression])
   extends Expression
-  with HiveInspectors
-  with Logging
   with UserDefinedExpression {
 
   override def nullable: Boolean = true
 
-  override lazy val deterministic: Boolean = isUDFDeterministic && 
children.forall(_.deterministic)
-
-  override def foldable: Boolean =
-isUDFDeterministic && returnInspector.isInstanceOf[ConstantObjectInspector]
-
-  @transient
-  lazy val function = funcWrapper.createFunction[GenericUDF]()
+  override lazy val deterministic: Boolean = evaluator.deterministic
 
-  @transient
-  private lazy val argumentInspectors = children.map(toInspector)
+  override def foldable: Boolean = evaluator.foldable
 
-  @transient
-  private lazy val returnInspector = {
-function.initializeAndFoldConstants(argumentInspectors.toArray)
-  }
+  override lazy val dataType: DataType = evaluator.dataType

Review Comment:
   ```suggestion
 override def dataType: DataType = evaluator.dataType
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a diff in pull request #40394: [SPARK-42771][SQL] Refactor HiveGenericUDF

2023-03-13 Thread via GitHub


cloud-fan commented on code in PR #40394:
URL: https://github.com/apache/spark/pull/40394#discussion_r1134995335


##
sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveUDFs.scala:
##
@@ -129,58 +129,25 @@ private[hive] class DeferredObjectAdapter(oi: 
ObjectInspector, dataType: DataTyp
 private[hive] case class HiveGenericUDF(
 name: String, funcWrapper: HiveFunctionWrapper, children: Seq[Expression])
   extends Expression
-  with HiveInspectors
-  with Logging
   with UserDefinedExpression {
 
   override def nullable: Boolean = true
 
-  override lazy val deterministic: Boolean = isUDFDeterministic && 
children.forall(_.deterministic)
-
-  override def foldable: Boolean =
-isUDFDeterministic && returnInspector.isInstanceOf[ConstantObjectInspector]
-
-  @transient
-  lazy val function = funcWrapper.createFunction[GenericUDF]()
+  override lazy val deterministic: Boolean = evaluator.deterministic
 
-  @transient
-  private lazy val argumentInspectors = children.map(toInspector)
+  override def foldable: Boolean = evaluator.foldable
 
-  @transient
-  private lazy val returnInspector = {
-function.initializeAndFoldConstants(argumentInspectors.toArray)
-  }
+  override lazy val dataType: DataType = evaluator.dataType
 
-  // Visible for codegen
   @transient
-  lazy val unwrapper: Any => Any = unwrapperFor(returnInspector)
-
-  @transient
-  private lazy val isUDFDeterministic = {
-val udfType = function.getClass.getAnnotation(classOf[HiveUDFType])
-udfType != null && udfType.deterministic() && !udfType.stateful()
-  }
-
-  // Visible for codegen
-  @transient
-  lazy val deferredObjects: Array[DeferredObject] = 
argumentInspectors.zip(children).map {
-case (inspect, child) => new DeferredObjectAdapter(inspect, child.dataType)
-  }.toArray[DeferredObject]
-
-  override lazy val dataType: DataType = inspectorToDataType(returnInspector)
+  private[hive] lazy val evaluator = new HiveGenericUDFEvaluator(funcWrapper, 
children)

Review Comment:
   ```suggestion
 private lazy val evaluator = new HiveGenericUDFEvaluator(funcWrapper, 
children)
   ```
   unless we need to access it somewhere.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng commented on a diff in pull request #40097: [SPARK-42508][CONNECT][ML] Extract the common .ml classes to `mllib-common`

2023-03-13 Thread via GitHub


zhengruifeng commented on code in PR #40097:
URL: https://github.com/apache/spark/pull/40097#discussion_r1134995656


##
mllib/core/src/main/scala/org/apache/spark/ml/param/shared/HasExecutionContext.scala:
##
@@ -0,0 +1,40 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.param.shared
+
+import scala.concurrent.ExecutionContext
+
+import org.apache.spark.util.ThreadUtils
+
+private[ml] trait HasExecutionContext extends HasParallelism {
+
+  /**
+   * Create a new execution context with a thread-pool that has a maximum 
number of threads
+   * set to the value of [[parallelism]]. If this param is set to 1, a 
same-thread executor
+   * will be used to run in serial.
+   */
+  private[ml] def getExecutionContext: ExecutionContext = {
+getParallelism match {
+  case 1 =>
+ThreadUtils.sameThread
+  case n =>
+ExecutionContext.fromExecutorService(ThreadUtils
+  
.newDaemonCachedThreadPool(s"${this.getClass.getSimpleName}-thread-pool", n))
+}
+  }

Review Comment:
   > Can we also move ThreadUtils to common ?
   
   I think it is not suitable to be in `mllib-common` ; it should be in another 
common module extracted from spark-core.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ueshin commented on pull request #40388: [SPARK-42765][CONNECT][PYTHON] Regulate the import path of `pandas_udf`

2023-03-13 Thread via GitHub


ueshin commented on PR #40388:
URL: https://github.com/apache/spark/pull/40388#issuecomment-1467120067

   I guess we can just with the comment:
   
   ```py
   # The implementation of pandas_udf is embedded in 
pyspark.sql.function.pandas_udf
   # for code reuse.
   from pyspark.sql.functions import pandas_udf as pandas_udf
   ```
   
   for those who want to import `pyspark.sql.connect.functions`?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] WeichenXu123 commented on a diff in pull request #40097: [SPARK-42508][CONNECT][ML] Extract the common .ml classes to `mllib-common`

2023-03-13 Thread via GitHub


WeichenXu123 commented on code in PR #40097:
URL: https://github.com/apache/spark/pull/40097#discussion_r1134787824


##
connector/connect/client/jvm/src/main/scala/org/apache/spark/ml/Estimator.scala:
##
@@ -0,0 +1,97 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml
+
+import scala.annotation.varargs
+
+import org.apache.spark.annotation.Since
+import org.apache.spark.ml.param.{ParamMap, ParamPair}
+import org.apache.spark.sql.Dataset
+
+/**
+ * Abstract class for estimators that fit models to data.
+ */
+abstract class Estimator[M <: Model[M]] extends PipelineStage {

Review Comment:
   For meta algorithm, I think we can move all of them to mllib-common, but we 
can do it in following PR.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gerashegalov commented on a diff in pull request #40372: [SPARK-42752][PYSPARK][SQL] Make PySpark exceptions printable during initialization

2023-03-13 Thread via GitHub


gerashegalov commented on code in PR #40372:
URL: https://github.com/apache/spark/pull/40372#discussion_r1134801751


##
python/pyspark/errors/exceptions/captured.py:
##
@@ -65,8 +65,15 @@ def __str__(self) -> str:
 assert SparkContext._jvm is not None
 
 jvm = SparkContext._jvm
-sql_conf = jvm.org.apache.spark.sql.internal.SQLConf.get()
-debug_enabled = sql_conf.pysparkJVMStacktraceEnabled()
+
+# SPARK-42752: default to True to see issues with initialization
+debug_enabled = True
+try:
+sql_conf = jvm.org.apache.spark.sql.internal.SQLConf.get()
+debug_enabled = sql_conf.pysparkJVMStacktraceEnabled()
+except BaseException:

Review Comment:
   I advocate for keeping the likelihood of an unhelpful unprintable exception 
during initialization to the minimum. I would not want to revisit the issue for 
other runtime exceptions.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] xinrong-meng commented on pull request #40388: [SPARK-42765][CONNECT][PYTHON] Regulate the import path of `pandas_udf`

2023-03-13 Thread via GitHub


xinrong-meng commented on PR #40388:
URL: https://github.com/apache/spark/pull/40388#issuecomment-1467281968

   We didn't wrap `pyspark.sql.function.pandas_udf` with 
`try_remote_functions`, so  `"PYSPARK_NO_NAMESPACE_SHARE"` should be irrelevant.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] wangyum commented on pull request #40360: [SPARK-42741][SQL] Do not unwrap casts in binary comparison when literal is null

2023-03-13 Thread via GitHub


wangyum commented on PR #40360:
URL: https://github.com/apache/spark/pull/40360#issuecomment-1467294557

   cc @cloud-fan  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] otterc commented on pull request #40393: []SPARK-40082]

2023-03-13 Thread via GitHub


otterc commented on PR #40393:
URL: https://github.com/apache/spark/pull/40393#issuecomment-1467314593

   @Stove-hust Thank you for reporting and the patch. Would you be able to 
share driver logs?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ueshin commented on pull request #40388: [SPARK-42765][CONNECT][PYTHON] Regulate the import path of `pandas_udf`

2023-03-13 Thread via GitHub


ueshin commented on PR #40388:
URL: https://github.com/apache/spark/pull/40388#issuecomment-1467349620

   It's irrelevant means it's an issue, no?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] WeichenXu123 commented on a diff in pull request #40097: [SPARK-42508][CONNECT][ML] Extract the common .ml classes to `mllib-common`

2023-03-13 Thread via GitHub


WeichenXu123 commented on code in PR #40097:
URL: https://github.com/apache/spark/pull/40097#discussion_r1134946826


##
mllib/core/src/main/scala/org/apache/spark/ml/param/shared/HasExecutionContext.scala:
##
@@ -0,0 +1,40 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.param.shared
+
+import scala.concurrent.ExecutionContext
+
+import org.apache.spark.util.ThreadUtils
+
+private[ml] trait HasExecutionContext extends HasParallelism {
+
+  /**
+   * Create a new execution context with a thread-pool that has a maximum 
number of threads
+   * set to the value of [[parallelism]]. If this param is set to 1, a 
same-thread executor
+   * will be used to run in serial.
+   */
+  private[ml] def getExecutionContext: ExecutionContext = {
+getParallelism match {
+  case 1 =>
+ThreadUtils.sameThread
+  case n =>
+ExecutionContext.fromExecutorService(ThreadUtils
+  
.newDaemonCachedThreadPool(s"${this.getClass.getSimpleName}-thread-pool", n))
+}
+  }

Review Comment:
   Can we also move `ThreadUtils` to common ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] panbingkun commented on a diff in pull request #39949: [SPARK-42386][SQL] Rewrite HiveGenericUDF with Invoke

2023-03-13 Thread via GitHub


panbingkun commented on code in PR #39949:
URL: https://github.com/apache/spark/pull/39949#discussion_r1133604830


##
sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveUDFs.scala:
##
@@ -194,47 +183,52 @@ private[hive] case class HiveGenericUDF(
 
   override protected def withNewChildrenInternal(newChildren: 
IndexedSeq[Expression]): Expression =
 copy(children = newChildren)
-  override protected def doGenCode(ctx: CodegenContext, ev: ExprCode): 
ExprCode = {
-val refTerm = ctx.addReferenceObj("this", this)
-val childrenEvals = children.map(_.genCode(ctx))
-
-val setDeferredObjects = childrenEvals.zipWithIndex.map {

Review Comment:
   I have submitted a new pr: https://github.com/apache/spark/pull/40394 to 
refactor `HiveGenericUDF`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] WeichenXu123 commented on a diff in pull request #40297: [SPARK-42412][WIP] Initial PR of Spark connect ML

2023-03-13 Thread via GitHub


WeichenXu123 commented on code in PR #40297:
URL: https://github.com/apache/spark/pull/40297#discussion_r1133666221


##
mllib/src/main/scala/org/apache/spark/ml/param/params.scala:
##
@@ -44,8 +45,14 @@ import org.apache.spark.ml.util.Identifiable
  *See [[ParamValidators]] for factory methods for common 
validation functions.
  * @tparam T param value type
  */
-class Param[T](val parent: String, val name: String, val doc: String, val 
isValid: T => Boolean)
-  extends Serializable {
+class Param[T: ClassTag](

Review Comment:
   @zhengruifeng 
   
   I recall I tried this approach `(implicit paramValueClassTag: ClassTag[T])` 
before , but it make us hard to get the classTag object. So I prefer current 
approach.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] shrprasa commented on a diff in pull request #40128: [SPARK-42466][K8S]: Cleanup k8s upload directory when job terminates

2023-03-13 Thread via GitHub


shrprasa commented on code in PR #40128:
URL: https://github.com/apache/spark/pull/40128#discussion_r1133671423


##
resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/KubernetesClientApplication.scala:
##
@@ -143,6 +144,9 @@ private[spark] class Client(
 logError("Please check \"kubectl auth can-i create [resource]\" 
first." +
   " It should be yes. And please also check your feature step 
implementation.")
 kubernetesClient.resourceList(preKubernetesResources: _*).delete()
+// register shutdownhook for cleaning up the upload dir only

Review Comment:
   Client mode will not help as in that case cleanup will be missed if job 
submission fails for cluster deploy mode.
   We are doing some other resource cleanup activities in the catch blocks of 
this class. So, registering the upload path cleanup should also be ok.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] beliefer closed pull request #39990: [SPARK-42415][SQL] The built-in dialects support OFFSET and paging query.

2023-03-13 Thread via GitHub


beliefer closed pull request #39990: [SPARK-42415][SQL] The built-in dialects 
support OFFSET and paging query.
URL: https://github.com/apache/spark/pull/39990


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] beliefer commented on pull request #39990: [SPARK-42415][SQL] The built-in dialects support OFFSET and paging query.

2023-03-13 Thread via GitHub


beliefer commented on PR #39990:
URL: https://github.com/apache/spark/pull/39990#issuecomment-1465922089

   https://github.com/apache/spark/pull/40396 used to replace this one.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] beliefer commented on pull request #40396: [SPARK-42772][SQL] Change the default value of JDBC options about push down to true

2023-03-13 Thread via GitHub


beliefer commented on PR #40396:
URL: https://github.com/apache/spark/pull/40396#issuecomment-1465922700

   ping @huaxingao cc @cloud-fan @sadikovi


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng commented on a diff in pull request #40297: [SPARK-42412][WIP] Initial PR of Spark connect ML

2023-03-13 Thread via GitHub


zhengruifeng commented on code in PR #40297:
URL: https://github.com/apache/spark/pull/40297#discussion_r1133771185


##
connector/connect/common/src/main/protobuf/spark/connect/ml.proto:
##
@@ -0,0 +1,170 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+syntax = 'proto3';
+
+package spark.connect;
+
+import "spark/connect/expressions.proto";
+import "spark/connect/relations.proto";
+import "spark/connect/ml_common.proto";
+
+option java_multiple_files = true;
+option java_package = "org.apache.spark.connect.proto";
+
+
+message MlEvaluator {
+  string name = 1;
+  MlParams params = 2;
+  string uid = 3;
+}
+
+
+message MlCommand {
+  oneof ml_command_type {
+Fit fit = 1;
+FetchModelAttr fetch_model_attr = 2;
+FetchModelSummaryAttr fetch_model_summary_attr = 3;
+LoadModel load_model = 4;
+SaveModel save_model = 5;
+Evaluate evaluate = 6;
+SaveStage save_stage = 7;
+LoadStage load_stage = 8;
+SaveEvaluator save_evaluator = 9;
+LoadEvaluator load_evaluator = 10;
+CopyModel copy_model = 11;
+  }
+
+  message Fit {
+MlStage estimator = 1;
+Relation dataset = 2;
+  }
+
+  message Evaluate {
+MlEvaluator evaluator = 1;
+  }
+
+  message LoadModel {
+string name = 1;
+string path = 2;
+  }
+
+  message SaveModel {
+int64 model_ref_id = 1;
+string path = 2; // saving path
+bool overwrite = 3;
+map options = 4; // saving options
+  }
+
+  message LoadStage {
+string name = 1;
+string path = 2;
+MlStage.StageType type = 3;
+  }
+
+  message SaveStage {
+MlStage stage = 1;
+string path = 2; // saving path
+bool overwrite = 3;
+map options = 4; // saving options
+  }
+
+  message LoadEvaluator {
+string name = 1;
+string path = 2;
+  }
+
+  message SaveEvaluator {
+MlEvaluator evaluator = 1;
+string path = 2; // saving path
+bool overwrite = 3;
+map options = 4; // saving options
+  }
+
+  message FetchModelAttr {
+int64 model_ref_id = 1;
+string name = 2;
+  }
+
+  message FetchModelSummaryAttr {
+int64 model_ref_id = 1;
+string name = 2;
+MlParams params = 3;
+
+// Evaluation dataset that it uses to computes
+// the summary attribute
+// If not set, get attributes from
+// model.summary (i.e. the summary on training dataset)
+optional Relation evaluation_dataset = 4;
+  }
+
+  message CopyModel {
+int64 model_ref_id = 1;
+  }
+}
+
+
+message MlCommandResponse {
+  oneof ml_command_response_type {
+Expression.Literal literal = 1;
+ModelInfo model_info = 2;
+Vector vector = 3;

Review Comment:
   looks fine.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] pan3793 commented on pull request #39160: [SPARK-41667][K8S] Expose env var SPARK_DRIVER_POD_NAME in Driver Pod

2023-03-13 Thread via GitHub


pan3793 commented on PR #39160:
URL: https://github.com/apache/spark/pull/39160#issuecomment-1465973424

   cross-refer comments from 
https://github.com/apache/spark/pull/40392#issuecomment-1465870752
   
   > Your PR tried to add `SPARK_DRIVER_POD_NAME` to Driver Pod to expose it to 
3rd party pods.
   
   > ... to use Driver Service instead
   
   Is the service name a kind of official API to allow 3rd party components to 
access Spark Driver in K8s? If yes, what about executor?
   
   My vision is exposing both driver and executor in a unified way to the log 
service, and aggregate logs by Pod is much straightforward, just like Yarn 
does, by container. So my first candidate is Pod Name, the second one is Pod IP.
   
   @dongjoon-hyun I do understand we should be careful to add each ENV 
variable, configuration, etc. If you think the Pod IP is acceptable, then it's 
sufficient after SPARK-42769.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] jdferreira opened a new pull request, #40398: Update `translate` docblock

2023-03-13 Thread via GitHub


jdferreira opened a new pull request, #40398:
URL: https://github.com/apache/spark/pull/40398

   ### What changes were proposed in this pull request?
   The documentation for the `translate` SQL function is a bit difficult to 
parse and understand. I propose the new texting.
   
   
   ### Why are the changes needed?
   To improve documentation
   
   ### Does this PR introduce _any_ user-facing change?
   I'm not sure, and I don't quite understand if I need to do something to make 
this documentation change become visible in the online documentation. I'd 
appreciate help here to improve the PR, if needed.
   
   ### How was this patch tested?
   No tests added or executed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang opened a new pull request, #40395: [SPARK-42770] WIP

2023-03-13 Thread via GitHub


LuciferYang opened a new pull request, #40395:
URL: https://github.com/apache/spark/pull/40395

   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] mridulm commented on pull request #40286: [SPARK-42577][CORE] Add max attempts limitation for stages to avoid potential infinite retry

2023-03-13 Thread via GitHub


mridulm commented on PR #40286:
URL: https://github.com/apache/spark/pull/40286#issuecomment-1465633950

   Merged to master.
   Thanks for fixing this @ivoson !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] mridulm closed pull request #40286: [SPARK-42577][CORE] Add max attempts limitation for stages to avoid potential infinite retry

2023-03-13 Thread via GitHub


mridulm closed pull request #40286: [SPARK-42577][CORE] Add max attempts 
limitation for stages to avoid potential infinite retry
URL: https://github.com/apache/spark/pull/40286


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] wangshengjie123 commented on pull request #40391: [WIP][SPARK-42766][YARN] YarnAllocator filter excluded nodes when launching containers

2023-03-13 Thread via GitHub


wangshengjie123 commented on PR #40391:
URL: https://github.com/apache/spark/pull/40391#issuecomment-1465682073

   I am not sure if we should add a Executor exit code and optimize the 
RegisterExecutor response message in this pr.In production environment, we 
found sometimes only filter the exclued node when launching containers does not 
work as well as we want, because we found maybe driver does not request new 
executors for a period of time, so the execluded nodes list wont be sent to 
`YarnAllocator` though `requestTotalExecutorsWithPreferredLocalities`, some 
executors will failed and count to the failure count.
   
   So we add the Executor exit code and optimize the RegisterExecutor response 
message.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng commented on a diff in pull request #40355: [SPARK-42604][CONNECT] Implement functions.typedlit

2023-03-13 Thread via GitHub


zhengruifeng commented on code in PR #40355:
URL: https://github.com/apache/spark/pull/40355#discussion_r1133568797


##
connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/PlanGenerationTestSuite.scala:
##
@@ -2065,6 +2065,44 @@ class PlanGenerationTestSuite
   fn.lit(Array(new CalendarInterval(2, 20, 100L), new CalendarInterval(2, 
21, 200L
   }
 
+  test("function typedLit") {
+simple.select(
+  fn.typedLit(fn.col("id")),
+  fn.typedLit('id),
+  fn.typedLit(1),
+  fn.typedLit[String](null),
+  fn.typedLit(true),
+  fn.typedLit(68.toByte),
+  fn.typedLit(9872.toShort),
+  fn.typedLit(-8726532),
+  fn.typedLit(7834609328726532L),
+  fn.typedLit(Math.E),
+  fn.typedLit(-0.8f),
+  fn.typedLit(BigDecimal(8997620, 5)),
+  fn.typedLit(BigDecimal(898897667231L, 7).bigDecimal),
+  fn.typedLit("connect!"),
+  fn.typedLit('T'),
+  fn.typedLit(Array.tabulate(10)(i => ('A' + i).toChar)),
+  fn.typedLit(Array.tabulate(23)(i => (i + 120).toByte)),
+  fn.typedLit(mutable.WrappedArray.make(Array[Byte](8.toByte, 6.toByte))),
+  fn.typedLit(null),
+  fn.typedLit(java.time.LocalDate.of(2020, 10, 10)),
+  fn.typedLit(Decimal.apply(BigDecimal(8997620, 6))),
+  fn.typedLit(java.time.Instant.ofEpochMilli(1677155519808L)),
+  fn.typedLit(new java.sql.Timestamp(12345L)),
+  fn.typedLit(java.time.LocalDateTime.of(2023, 2, 23, 20, 36)),
+  fn.typedLit(java.sql.Date.valueOf("2023-02-23")),
+  fn.typedLit(java.time.Duration.ofSeconds(200L)),
+  fn.typedLit(java.time.Period.ofDays(100)),
+  fn.typedLit(new CalendarInterval(2, 20, 100L)),
+
+  // Handle parameterized scala types e.g.: List, Seq and Map.
+  fn.typedLit(Some(1)),
+  fn.typedLit(Seq(1, 2, 3)),
+  fn.typedLit(Map("a" -> 1, "b" -> 2)),
+  fn.typedLit(("a", 2, 1.0)))

Review Comment:
   what about adding a few nested cases?
   e.g.  array; array; map>



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] WeichenXu123 commented on a diff in pull request #40297: [SPARK-42412][WIP] Initial PR of Spark connect ML

2023-03-13 Thread via GitHub


WeichenXu123 commented on code in PR #40297:
URL: https://github.com/apache/spark/pull/40297#discussion_r1133666221


##
mllib/src/main/scala/org/apache/spark/ml/param/params.scala:
##
@@ -44,8 +45,14 @@ import org.apache.spark.ml.util.Identifiable
  *See [[ParamValidators]] for factory methods for common 
validation functions.
  * @tparam T param value type
  */
-class Param[T](val parent: String, val name: String, val doc: String, val 
isValid: T => Boolean)
-  extends Serializable {
+class Param[T: ClassTag](

Review Comment:
   @zhengruifeng 
   
   I recall I tried this approach `(implicit paramValueClassTag: ClassTag[T])` 
before , but it makes us hard to get the classTag object. So I prefer current 
approach.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] pan3793 commented on pull request #40392: [SPARK-42769][K8S] Add `SPARK_DRIVER_POD_IP` env variable to executor pods

2023-03-13 Thread via GitHub


pan3793 commented on PR #40392:
URL: https://github.com/apache/spark/pull/40392#issuecomment-1465812124

   > ... for some executor pods to connect driver pods via IP.
   
   Hi @dongjoon-hyun, I think it's quite useful, but in 
https://github.com/apache/spark/pull/39160#pullrequestreview-1229691638, you 
left a concern
   
   > ... I have a concern. Currently, Apache Spark uses K8s Service entity via 
[DriverServiceFeatureStep](https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/DriverServiceFeatureStep.scala)
 to access Spark driver pod in K8s environment.
   
   do you still concern that now?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on pull request #40395: [SPARK-42770][CONNECT] Add `truncatedTo(ChronoUnit.MICROS)` to make `SQLImplicitsTestSuite` test pass on Linux

2023-03-13 Thread via GitHub


LuciferYang commented on PR #40395:
URL: https://github.com/apache/spark/pull/40395#issuecomment-1465860908

   https://github.com/apache/spark/actions/runs/4318647315/jobs/7537203682
   
   ```
   [info] - test implicit encoder resolution *** FAILED *** (1 second, 329 
milliseconds)
   4429[info]   2023-03-02T23:00:20.404434 did not equal 
2023-03-02T23:00:20.404434875 (SQLImplicitsTestSuite.scala:63)
   4430[info]   org.scalatest.exceptions.TestFailedException:
   4431[info]   at 
org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472)
   4432[info]   at 
org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471)
   4433[info]   at 
org.scalatest.Assertions$.newAssertionFailedException(Assertions.scala:1231)
   4434[info]   at 
org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:1295)
   4435[info]   at 
org.apache.spark.sql.SQLImplicitsTestSuite.testImplicit$1(SQLImplicitsTestSuite.scala:63)
   4436[info]   at 
org.apache.spark.sql.SQLImplicitsTestSuite.$anonfun$new$2(SQLImplicitsTestSuite.scala:133)
   4437[info]   at 
scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
   4438[info]   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
   4439[info]   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
   4440[info]   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
   4441[info]   at org.scalatest.Transformer.apply(Transformer.scala:22)
   4442[info]   at org.scalatest.Transformer.apply(Transformer.scala:20)
   4443[info]   at 
org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:226)
   [info]   at org.scalatest.TestSuite.withFixture(TestSuite.scala:196)
   4445[info]   at org.scalatest.TestSuite.withFixture$(TestSuite.scala:195)
   4446[info]   at 
org.scalatest.funsuite.AnyFunSuite.withFixture(AnyFunSuite.scala:1564)
   4447[info]   at 
org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:224)
   4448[info]   at 
org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:236)
   4449[info]   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
   4450[info]   at 
org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:236)
   4451[info]   at 
org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:218)
   4452[info]   at 
org.scalatest.funsuite.AnyFunSuite.runTest(AnyFunSuite.scala:1564)
   4453[info]   at 
org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:269)
   4454[info]   at 
org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:413)
   4455[info]   at scala.collection.immutable.List.foreach(List.scala:431)
   4456[info]   at 
org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
   4457[info]   at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:396)
   4458[info]   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:475)
   4459[info]   at 
org.scalatest.funsuite.AnyFunSuiteLike.runTests(AnyFunSuiteLike.scala:269)
   4460[info]   at 
org.scalatest.funsuite.AnyFunSuiteLike.runTests$(AnyFunSuiteLike.scala:268)
   4461[info]   at 
org.scalatest.funsuite.AnyFunSuite.runTests(AnyFunSuite.scala:1564)
   4462[info]   at org.scalatest.Suite.run(Suite.scala:1114)
   4463[info]   at org.scalatest.Suite.run$(Suite.scala:1096)
   4464[info]   at 
org.scalatest.funsuite.AnyFunSuite.org$scalatest$funsuite$AnyFunSuiteLike$$super$run(AnyFunSuite.scala:1564)
   4465[info]   at 
org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$run$1(AnyFunSuiteLike.scala:273)
   4466[info]   at org.scalatest.SuperEngine.runImpl(Engine.scala:535)
   4467[info]   at 
org.scalatest.funsuite.AnyFunSuiteLike.run(AnyFunSuiteLike.scala:273)
   4468[info]   at 
org.scalatest.funsuite.AnyFunSuiteLike.run$(AnyFunSuiteLike.scala:272)
   4469[info]   at 
org.apache.spark.sql.SQLImplicitsTestSuite.org$scalatest$BeforeAndAfterAll$$super$run(SQLImplicitsTestSuite.scala:34)
   4470[info]   at 
org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:213)
   4471[info]   at 
org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210)
   4472[info]   at 
org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208)
   4473[info]   at 
org.apache.spark.sql.SQLImplicitsTestSuite.run(SQLImplicitsTestSuite.scala:34)
   4474[info]   at 
org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:321)
   4475[info]   at 
org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:517)
   4476[info]   at sbt.ForkMain$Run.lambda$runTest$1(ForkMain.java:413)
   4477[info]   at 
java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
   4478[info]   at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
   4479[info]   at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
   4480[info]   at java.base/java.lang.Thread.run(Thread.java:833) 
   ```


[GitHub] [spark] dongjoon-hyun commented on pull request #40392: [SPARK-42769][K8S] Add `SPARK_DRIVER_POD_IP` env variable to executor pods

2023-03-13 Thread via GitHub


dongjoon-hyun commented on PR #40392:
URL: https://github.com/apache/spark/pull/40392#issuecomment-1465870752

   @pan3793 .
   
   The goal of PR is different from your PR's goal.
   - Your PR tried to add `SPARK_DRIVER_POD_NAME` to `Driver Pod` to expose it 
to **3rd party pods**.
   - This PR aims to add `SPARK_DRIVER_POD_IP` to `Executor Pod` in order to 
help **internal communications between Spark executors and Spark driver**.
   
   In addition, this is a kind of propagation of the information from the 
driver pod to the executor pods instead of exposing the executor pods' internal 
information.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on pull request #40395: [SPARK-42770][CONNECT] Add `truncatedTo(ChronoUnit.MICROS)` to make `SQLImplicitsTestSuite` in Java 17 daily test GA task pass

2023-03-13 Thread via GitHub


LuciferYang commented on PR #40395:
URL: https://github.com/apache/spark/pull/40395#issuecomment-1465912943

   cc @HyukjinKwon 
   also cc @bjornjorgensen who reported this issue in dev mail list
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] beliefer opened a new pull request, #40396: [SPARK-42772][SQL] Change the default value of JDBC options about push down to true

2023-03-13 Thread via GitHub


beliefer opened a new pull request, #40396:
URL: https://github.com/apache/spark/pull/40396

   ### What changes were proposed in this pull request?
   Currently, DS V2 pushdown could let JDBC dialect decide to push down 
`OFFSET`, `LIMIT` and table sample. Because some databases doesn't support one 
of them, so we should change the default value of these pushdown API false. If 
one database support the syntax, the JDBC dialect should overwrite the value.
   
   We also have a lot of JDBC options about push down, such as 
`pushDownOffset`. Users could change the option value to allow or disallow push 
down.
   
   ### Why are the changes needed?
   This PR change all JDBC v2 pushdown options to true and change all the 
dialect's pushdown API to false.
   
   
   ### Does this PR introduce _any_ user-facing change?
   'Yes'.
   The default behavior of pushdown framework is not push down SQL syntax to 
JDBC data source.
   Users could control the pushdown enable or disable with JDBC options about 
push down, such as `pushDownOffset`. 
   
   
   ### How was this patch tested?
   Test cases updated.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] panbingkun opened a new pull request, #40397: [SPARK-42052][SQL] Codegen Support for HiveSimpleUDF

2023-03-13 Thread via GitHub


panbingkun opened a new pull request, #40397:
URL: https://github.com/apache/spark/pull/40397

   ### What changes were proposed in this pull request?
   
   
   ### Why are the changes needed?
   
   
   ### Does this PR introduce _any_ user-facing change?
   No.
   
   ### How was this patch tested?
   Add new UT.
   Pass GA.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] panbingkun commented on a diff in pull request #39949: [SPARK-42386][SQL] Rewrite HiveGenericUDF with Invoke

2023-03-13 Thread via GitHub


panbingkun commented on code in PR #39949:
URL: https://github.com/apache/spark/pull/39949#discussion_r1133823856


##
sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveUDFs.scala:
##
@@ -194,47 +183,52 @@ private[hive] case class HiveGenericUDF(
 
   override protected def withNewChildrenInternal(newChildren: 
IndexedSeq[Expression]): Expression =
 copy(children = newChildren)
-  override protected def doGenCode(ctx: CodegenContext, ev: ExprCode): 
ExprCode = {
-val refTerm = ctx.addReferenceObj("this", this)
-val childrenEvals = children.map(_.genCode(ctx))
-
-val setDeferredObjects = childrenEvals.zipWithIndex.map {

Review Comment:
   Follow it to implement codegen of HiveSimpleUDF: 
https://github.com/apache/spark/pull/40397



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] EnricoMi commented on pull request #39952: [SPARK-40770][PYTHON][FOLLOW-UP] Improved error messages for mapInPandas for schema mismatch

2023-03-13 Thread via GitHub


EnricoMi commented on PR #39952:
URL: https://github.com/apache/spark/pull/39952#issuecomment-1465739961

   CC @cloud-fan @itholic @zhengruifeng 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   >