Re: [PR] [SPARK-48037][CORE][3.5] Fix SortShuffleWriter lacks shuffle write related metrics resulting in potentially inaccurate data [spark]

2024-05-07 Thread via GitHub


dongjoon-hyun commented on PR #46459:
URL: https://github.com/apache/spark/pull/46459#issuecomment-2099780272

   Merged to branch-3.5.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48037][CORE][3.5] Fix SortShuffleWriter lacks shuffle write related metrics resulting in potentially inaccurate data [spark]

2024-05-07 Thread via GitHub


dongjoon-hyun closed pull request #46459: [SPARK-48037][CORE][3.5] Fix 
SortShuffleWriter lacks shuffle write related metrics resulting in potentially 
inaccurate data
URL: https://github.com/apache/spark/pull/46459


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47777][PYTHON][SS][TESTS] Add spark connect test for python streaming data source [spark]

2024-05-07 Thread via GitHub


chaoqin-li1123 commented on PR #45950:
URL: https://github.com/apache/spark/pull/45950#issuecomment-2099774647

   This seems to be broken in the main function of pyspark init(), what is the 
expected action item we should take? @HyukjinKwon 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48037][CORE][3.5] Fix SortShuffleWriter lacks shuffle write related metrics resulting in potentially inaccurate data [spark]

2024-05-07 Thread via GitHub


dongjoon-hyun commented on PR #46459:
URL: https://github.com/apache/spark/pull/46459#issuecomment-2099760350

   Could you make a backporting PR to branch-3.4 too, @cxzl25 ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48185][SQL] Fix 'symbolic reference class is not accessible: class sun.util.calendar.ZoneInfo' [spark]

2024-05-07 Thread via GitHub


LuciferYang commented on code in PR #46457:
URL: https://github.com/apache/spark/pull/46457#discussion_r1593383893


##
sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/SparkDateTimeUtils.scala:
##
@@ -197,8 +197,8 @@ trait SparkDateTimeUtils {
 rebaseJulianToGregorianDays(julianDays)
   }
 
-  private val zoneInfoClassName = "sun.util.calendar.ZoneInfo"
-  private val getOffsetsByWallHandle = {
+  private lazy val zoneInfoClassName = "sun.util.calendar.ZoneInfo"

Review Comment:
   On the other hand, if `TimeZone.getDefault.getOffset(localMillis)` can be 
uniformly used to get `timeZoneOffset` in the `toJavaDate` function, then the 
use of the `MethodHandle` mechanism can be avoided here.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48185][SQL] Fix 'symbolic reference class is not accessible: class sun.util.calendar.ZoneInfo' [spark]

2024-05-07 Thread via GitHub


LuciferYang commented on code in PR #46457:
URL: https://github.com/apache/spark/pull/46457#discussion_r1593382052


##
sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/SparkDateTimeUtils.scala:
##
@@ -197,8 +197,8 @@ trait SparkDateTimeUtils {
 rebaseJulianToGregorianDays(julianDays)
   }
 
-  private val zoneInfoClassName = "sun.util.calendar.ZoneInfo"
-  private val getOffsetsByWallHandle = {
+  private lazy val zoneInfoClassName = "sun.util.calendar.ZoneInfo"

Review Comment:
   Is it necessary to use `lazy` for this since `zoneInfoClassName` is just a 
`String`? 'Lazy' comes with the cost of locking.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [WIP][SQL] Add collation support for JSON expressions [spark]

2024-05-07 Thread via GitHub


uros-db opened a new pull request, #46462:
URL: https://github.com/apache/spark/pull/46462

   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48184][PYTHON][CONNECT] Always set the seed of `Dataframe.sample` in Client side [spark]

2024-05-07 Thread via GitHub


dongjoon-hyun commented on code in PR #46456:
URL: https://github.com/apache/spark/pull/46456#discussion_r1593374153


##
python/pyspark/sql/tests/test_dataframe.py:
##
@@ -430,6 +430,11 @@ def test_sample(self):
 IllegalArgumentException, lambda: 
self.spark.range(1).sample(-1.0).count()
 )
 
+def test_sample_with_random_seed(self):
+df = self.spark.range(1).sample(0.1)
+cnts = [df.count() for i in range(10)]
+self.assertEqual(1, len(set(cnts)))

Review Comment:
   Oh, it seems that we have a unit test.
   
   ```
   FAIL [0.001s]: test_sample 
(pyspark.sql.tests.connect.test_connect_plan.SparkConnectPlanTests.test_sample)
   --
   Traceback (most recent call last):
 File 
"/__w/spark/spark/python/pyspark/sql/tests/connect/test_connect_plan.py", line 
446, in test_sample
   self.assertEqual(plan.root.sample.HasField("seed"), False)
   AssertionError: True != False
   ```



##
python/pyspark/sql/tests/test_dataframe.py:
##
@@ -430,6 +430,11 @@ def test_sample(self):
 IllegalArgumentException, lambda: 
self.spark.range(1).sample(-1.0).count()
 )
 
+def test_sample_with_random_seed(self):
+df = self.spark.range(1).sample(0.1)
+cnts = [df.count() for i in range(10)]
+self.assertEqual(1, len(set(cnts)))

Review Comment:
   Oh, it seems that we have a unit test. Could you update it?
   
   ```
   FAIL [0.001s]: test_sample 
(pyspark.sql.tests.connect.test_connect_plan.SparkConnectPlanTests.test_sample)
   --
   Traceback (most recent call last):
 File 
"/__w/spark/spark/python/pyspark/sql/tests/connect/test_connect_plan.py", line 
446, in test_sample
   self.assertEqual(plan.root.sample.HasField("seed"), False)
   AssertionError: True != False
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48184][PYTHON][CONNECT] Always set the seed of `Dataframe.sample` in Client side [spark]

2024-05-07 Thread via GitHub


dongjoon-hyun commented on code in PR #46456:
URL: https://github.com/apache/spark/pull/46456#discussion_r1593374153


##
python/pyspark/sql/tests/test_dataframe.py:
##
@@ -430,6 +430,11 @@ def test_sample(self):
 IllegalArgumentException, lambda: 
self.spark.range(1).sample(-1.0).count()
 )
 
+def test_sample_with_random_seed(self):
+df = self.spark.range(1).sample(0.1)
+cnts = [df.count() for i in range(10)]
+self.assertEqual(1, len(set(cnts)))

Review Comment:
   Oh, it seems that we have a unit test already. Shall we remove this and 
update that one, @zhengruifeng , in order to avoid increasing test time?
   
   ```
   FAIL [0.001s]: test_sample 
(pyspark.sql.tests.connect.test_connect_plan.SparkConnectPlanTests.test_sample)
   --
   Traceback (most recent call last):
 File 
"/__w/spark/spark/python/pyspark/sql/tests/connect/test_connect_plan.py", line 
446, in test_sample
   self.assertEqual(plan.root.sample.HasField("seed"), False)
   AssertionError: True != False
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [WIP][SQL] Misc expressions [spark]

2024-05-07 Thread via GitHub


uros-db opened a new pull request, #46461:
URL: https://github.com/apache/spark/pull/46461

   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48183][PYTHON][DOCS] Update error contribution guide to respect new error class file [spark]

2024-05-07 Thread via GitHub


dongjoon-hyun commented on PR #46455:
URL: https://github.com/apache/spark/pull/46455#issuecomment-2099711660

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48183][PYTHON][DOCS] Update error contribution guide to respect new error class file [spark]

2024-05-07 Thread via GitHub


dongjoon-hyun closed pull request #46455: [SPARK-48183][PYTHON][DOCS] Update 
error contribution guide to respect new error class file
URL: https://github.com/apache/spark/pull/46455


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [WIP][SQL] Add collation support for URL expressions [spark]

2024-05-07 Thread via GitHub


uros-db opened a new pull request, #46460:
URL: https://github.com/apache/spark/pull/46460

   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [SPARK-48037][CORE][3.5] Fix SortShuffleWriter lacks shuffle write related metrics resulting in potentially inaccurate data [spark]

2024-05-07 Thread via GitHub


cxzl25 opened a new pull request, #46459:
URL: https://github.com/apache/spark/pull/46459

   ### What changes were proposed in this pull request?
   This PR aims to fix SortShuffleWriter lacks shuffle write related metrics 
resulting in potentially inaccurate data.
   
   ### Why are the changes needed?
   When the shuffle writer is SortShuffleWriter, it does not use 
SQLShuffleWriteMetricsReporter to update metrics, which causes AQE to obtain 
runtime statistics and the rowCount obtained is 0.
   
   Some optimization rules rely on rowCount statistics, such as 
`EliminateLimits`. Because rowCount is 0, it removes the limit operator. At 
this time, we get data results without limit.
   
   
https://github.com/apache/spark/blob/59d5946cfd377e9203ccf572deb34f87fab7510c/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/ShuffleExchangeExec.scala#L168-L172
   
   
https://github.com/apache/spark/blob/59d5946cfd377e9203ccf572deb34f87fab7510c/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L2067-L2070
   
   
   ### Does this PR introduce _any_ user-facing change?
   Yes
   
   
   ### How was this patch tested?
   Production environment verification.
   
   **master metrics**
   https://github.com/apache/spark/assets/3898450/dc9b6e8a-93ec-4f59-a903-71aa5b11962c;>
   
   **PR metrics**
   
   https://github.com/apache/spark/assets/3898450/2d73b773-2dcc-4d23-81de-25dcadac86c1;>
   
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [WIP][SQL] Add support for AbstractMapType [spark]

2024-05-07 Thread via GitHub


uros-db opened a new pull request, #46458:
URL: https://github.com/apache/spark/pull/46458

   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47914][SQL] Do not display the splits parameter in Range [spark]

2024-05-07 Thread via GitHub


yaooqinn commented on PR #46136:
URL: https://github.com/apache/spark/pull/46136#issuecomment-2099698487

   Merged to master
   
   Thank you @guixiaowen 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47914][SQL] Do not display the splits parameter in Range [spark]

2024-05-07 Thread via GitHub


yaooqinn closed pull request #46136: [SPARK-47914][SQL] Do not display the 
splits parameter in Range
URL: https://github.com/apache/spark/pull/46136


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [SPARK-48185][SQL] Fix 'symbolic reference class is not accessible: class sun.util.calendar.ZoneInfo' [spark]

2024-05-07 Thread via GitHub


yaooqinn opened a new pull request, #46457:
URL: https://github.com/apache/spark/pull/46457

   
   
   
   ### What changes were proposed in this pull request?
   
   
   I met the error below while debugging UTs because of loading 
`sun.util.calendar.ZoneInfo` eagerly. This PR makes the relevant variables lazy.
   
   ```log
   Caused by: java.lang.IllegalAccessException: symbolic reference class is not 
accessible: class sun.util.calendar.ZoneInfo, from interface 
org.apache.spark.sql.catalyst.util.SparkDateTimeUtils (unnamed module @65d6b83b)
at 
java.base/java.lang.invoke.MemberName.makeAccessException(MemberName.java:955)
at 
java.base/java.lang.invoke.MethodHandles$Lookup.checkSymbolicClass(MethodHandles.java:3686)
at 
java.base/java.lang.invoke.MethodHandles$Lookup.resolveOrFail(MethodHandles.java:3646)
at 
java.base/java.lang.invoke.MethodHandles$Lookup.findVirtual(MethodHandles.java:2680)
at 
org.apache.spark.sql.catalyst.util.SparkDateTimeUtils.$init$(SparkDateTimeUtils.scala:206)
at 
org.apache.spark.sql.catalyst.util.DateTimeUtils$.(DateTimeUtils.scala:41)
... 82 more
   ```
   
   ### Why are the changes needed?
   
   sun.util.calendar.ZoneInfo is inaccessible in some scenarios. 
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   Yes, such errors might be delayed from backend-scheduling to job-scheduling
   
   ### How was this patch tested?
   
   I tested with idea and UT debugging locally
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   no


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-42093][SQL] Move JavaTypeInference to AgnosticEncoders [spark]

2024-05-07 Thread via GitHub


viirya commented on code in PR #39615:
URL: https://github.com/apache/spark/pull/39615#discussion_r1593353161


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/JavaTypeInference.scala:
##
@@ -166,317 +148,58 @@ object JavaTypeInference {
   .filter(_.getReadMethod != null)
   }
 
-  private def getJavaBeanReadableAndWritableProperties(
-  beanClass: Class[_]): Array[PropertyDescriptor] = {
-getJavaBeanReadableProperties(beanClass).filter(_.getWriteMethod != null)
-  }
-
-  private def elementType(typeToken: TypeToken[_]): TypeToken[_] = {
-val typeToken2 = typeToken.asInstanceOf[TypeToken[_ <: JIterable[_]]]
-val iterableSuperType = typeToken2.getSupertype(classOf[JIterable[_]])
-val iteratorType = iterableSuperType.resolveType(iteratorReturnType)
-iteratorType.resolveType(nextReturnType)
-  }
-
-  private def mapKeyValueType(typeToken: TypeToken[_]): (TypeToken[_], 
TypeToken[_]) = {
-val typeToken2 = typeToken.asInstanceOf[TypeToken[_ <: JMap[_, _]]]
-val mapSuperType = typeToken2.getSupertype(classOf[JMap[_, _]])
-val keyType = elementType(mapSuperType.resolveType(keySetReturnType))
-val valueType = elementType(mapSuperType.resolveType(valuesReturnType))
-keyType -> valueType
-  }
-
-  /**
-   * Returns the Spark SQL DataType for a given java class.  Where this is not 
an exact mapping
-   * to a native type, an ObjectType is returned.
-   *
-   * Unlike `inferDataType`, this function doesn't do any massaging of types 
into the Spark SQL type
-   * system.  As a result, ObjectType will be returned for things like boxed 
Integers.
-   */
-  private def inferExternalType(cls: Class[_]): DataType = cls match {
-case c if c == java.lang.Boolean.TYPE => BooleanType
-case c if c == java.lang.Byte.TYPE => ByteType
-case c if c == java.lang.Short.TYPE => ShortType
-case c if c == java.lang.Integer.TYPE => IntegerType
-case c if c == java.lang.Long.TYPE => LongType
-case c if c == java.lang.Float.TYPE => FloatType
-case c if c == java.lang.Double.TYPE => DoubleType
-case c if c == classOf[Array[Byte]] => BinaryType
-case _ => ObjectType(cls)
-  }
-
-  /**
-   * Returns an expression that can be used to deserialize a Spark SQL 
representation to an object
-   * of java bean `T` with a compatible schema.  The Spark SQL representation 
is located at ordinal
-   * 0 of a row, i.e., `GetColumnByOrdinal(0, _)`. Nested classes will have 
their fields accessed
-   * using `UnresolvedExtractValue`.
-   */
-  def deserializerFor(beanClass: Class[_]): Expression = {
-val typeToken = TypeToken.of(beanClass)
-val walkedTypePath = new 
WalkedTypePath().recordRoot(beanClass.getCanonicalName)
-val (dataType, nullable) = inferDataType(typeToken)
-
-// Assumes we are deserializing the first column of a row.
-deserializerForWithNullSafetyAndUpcast(GetColumnByOrdinal(0, dataType), 
dataType,
-  nullable = nullable, walkedTypePath, deserializerFor(typeToken, _, 
walkedTypePath))
-  }
-
-  private def deserializerFor(
-  typeToken: TypeToken[_],
-  path: Expression,
-  walkedTypePath: WalkedTypePath): Expression = {
-typeToken.getRawType match {
-  case c if !inferExternalType(c).isInstanceOf[ObjectType] => path
-
-  case c if c == classOf[java.lang.Short] ||
-c == classOf[java.lang.Integer] ||
-c == classOf[java.lang.Long] ||
-c == classOf[java.lang.Double] ||
-c == classOf[java.lang.Float] ||
-c == classOf[java.lang.Byte] ||
-c == classOf[java.lang.Boolean] =>
-createDeserializerForTypesSupportValueOf(path, c)
-
-  case c if c == classOf[java.time.LocalDate] =>
-createDeserializerForLocalDate(path)
-
-  case c if c == classOf[java.sql.Date] =>
-createDeserializerForSqlDate(path)
-
-  case c if c == classOf[java.time.Instant] =>
-createDeserializerForInstant(path)
-
-  case c if c == classOf[java.sql.Timestamp] =>
-createDeserializerForSqlTimestamp(path)
+  private class ImplementsGenericInterface(interface: Class[_]) {
+assert(interface.isInterface)
+assert(interface.getTypeParameters.nonEmpty)
 
-  case c if c == classOf[java.time.LocalDateTime] =>
-createDeserializerForLocalDateTime(path)
-
-  case c if c == classOf[java.time.Duration] =>
-createDeserializerForDuration(path)
-
-  case c if c == classOf[java.time.Period] =>
-createDeserializerForPeriod(path)
-
-  case c if c == classOf[java.lang.String] =>
-createDeserializerForString(path, returnNullable = true)
-
-  case c if c == classOf[java.math.BigDecimal] =>
-createDeserializerForJavaBigDecimal(path, returnNullable = true)
-
-  case c if c == classOf[java.math.BigInteger] =>
-createDeserializerForJavaBigInteger(path, returnNullable = true)
-
-  case c if c.isArray =>
-val 

Re: [PR] [SPARK-48037][CORE] Fix SortShuffleWriter lacks shuffle write related metrics resulting in potentially inaccurate data [spark]

2024-05-07 Thread via GitHub


cxzl25 commented on code in PR #46273:
URL: https://github.com/apache/spark/pull/46273#discussion_r1593350763


##
sql/core/src/test/scala/org/apache/spark/sql/execution/adaptive/AdaptiveQueryExecSuite.scala:
##
@@ -2502,6 +2502,26 @@ class AdaptiveQueryExecSuite
 }
   }
 
+  test("SPARK-48037: Fix SortShuffleWriter lacks shuffle write related metrics 
" +
+"resulting in potentially inaccurate data") {
+withTable("t3") {
+  withSQLConf(
+SQLConf.ADAPTIVE_EXECUTION_ENABLED.key -> "true",
+SQLConf.SHUFFLE_PARTITIONS.key -> "16777217") {
+sql("CREATE TABLE t3 USING PARQUET AS SELECT id FROM range(2)")
+val (plan, adaptivePlan) = runAdaptiveAndVerifyResult(
+  """
+|SELECT id, count(*)
+|FROM t3
+|GROUP BY id
+|LIMIT 1
+|""".stripMargin, skipCheckAnswer = true)

Review Comment:
   In this test case, because `spark.sql.adaptive.enabled=false`, partitions 
will not be merged. 
   It has a large number of partitions and tasks, so it requires a large amount 
of driver memory to execute successfully.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48168][SQL] Add bitwise shifting operators support [spark]

2024-05-07 Thread via GitHub


dongjoon-hyun commented on PR #46440:
URL: https://github.com/apache/spark/pull/46440#issuecomment-2099685242

   No problem, @yaooqinn . We have enough time for 4.0.0. :) 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48168][SQL] Add bitwise shifting operators support [spark]

2024-05-07 Thread via GitHub


yaooqinn commented on PR #46440:
URL: https://github.com/apache/spark/pull/46440#issuecomment-2099683026

   Thank you @dongjoon-hyun 
   Let me look into this issue, it might take a while


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48184][PYTHON][CONNECT] Always set the seed of `Dataframe.sample` in Client side [spark]

2024-05-07 Thread via GitHub


dongjoon-hyun commented on PR #46456:
URL: https://github.com/apache/spark/pull/46456#issuecomment-2099681039

   Ah, I got your point. It's a very interesting `connector` bug.
   > df2 should be immutable.
   
   I was thinking the following. My bad.
   ```
   scala> spark.range(1).sample(0.1).count()
   res0: Long = 1080
   
   scala> spark.range(1).sample(0.1).count()
   res1: Long = 998
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48184][PYTHON][CONNECT] Always set the seed of `Dataframe.sample` in Client side [spark]

2024-05-07 Thread via GitHub


zhengruifeng commented on PR #46456:
URL: https://github.com/apache/spark/pull/46456#issuecomment-2099676615

   @dongjoon-hyun `df2 = df1.sample()`, the dataframe `df2` should be immutable.
   I think it is a bug, we should backport it, have update the affected 
versions in jira.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [SPARK-48184][PYTHON][CONNECT] Always set the seed of `Dataframe.sample` in Client side [spark]

2024-05-07 Thread via GitHub


zhengruifeng opened a new pull request, #46456:
URL: https://github.com/apache/spark/pull/46456

   ### What changes were proposed in this pull request?
   Always set the seed of `Dataframe.sample` in Client side
   
   
   ### Why are the changes needed?
   Bug fix
   
   If the seed is not set in Client, it will be set in server side with a 
random int
   
   
https://github.com/apache/spark/blob/c4df12cc884cddefcfcf8324b4d7b9349fb4f6a0/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala#L386
   
   which cause inconsistent results in multiple executions
   
   In Spark Classic:
   ```
   In [1]: df = spark.range(1).sample(0.1)
   
   In [2]: [df.count() for i in range(10)]
   Out[2]: [1006, 1006, 1006, 1006, 1006, 1006, 1006, 1006, 1006, 1006]
   ```
   
   In Spark Connect:
   
   before:
   ```
   In [1]: df = spark.range(1).sample(0.1)
   
   In [2]: [df.count() for i in range(10)]  
 Out[2]: [969, 
1005, 958, 996, 987, 1026, 991, 1020, 1012, 979]
   ```
   
   after:
   ```
   In [1]: df = spark.range(1).sample(0.1)
   
   In [2]: [df.count() for i in range(10)]  
  Out[2]: 
[1032, 1032, 1032, 1032, 1032, 1032, 1032, 1032, 1032, 1032]
   ```
   
   ### Does this PR introduce _any_ user-facing change?
   yes, bug fix
   
   
   ### How was this patch tested?
   ci
   
   ### Was this patch authored or co-authored using generative AI tooling?
   no


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [MINOR][PYTHON][TESTS] Remove the doc in error message tests to allow other PyArrow versions in tests [spark]

2024-05-07 Thread via GitHub


dongjoon-hyun commented on PR #46453:
URL: https://github.com/apache/spark/pull/46453#issuecomment-2099655997

   Merged to master.
   
   Feel free to backport this if you need in branch-3.5.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [MINOR][PYTHON][TESTS] Remove the doc in error message tests to allow other PyArrow versions in tests [spark]

2024-05-07 Thread via GitHub


dongjoon-hyun closed pull request #46453: [MINOR][PYTHON][TESTS] Remove the doc 
in error message tests to allow other PyArrow versions in tests
URL: https://github.com/apache/spark/pull/46453


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48180][SQL] Improve error when UDTF call with TABLE arg forgets parentheses around multiple PARTITION/ORDER BY exprs [spark]

2024-05-07 Thread via GitHub


HyukjinKwon commented on code in PR #46451:
URL: https://github.com/apache/spark/pull/46451#discussion_r1593312100


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala:
##
@@ -1648,6 +1648,20 @@ class AstBuilder extends DataTypeAstBuilder with 
SQLConfHelper with Logging {
   message = "ORDER BY cannot be specified unless either " +
 "PARTITION BY or WITH SINGLE PARTITION is also present",
   ctx = ctx.tableArgumentPartitioning)
+def invalidPartitionOrOrderingExpression(clause: String): String = {
+  s"The table function call includes a table argument with an invalid 
partitioning/ordering " +
+s"specification: the $clause clause included multiple expressions 
without parentheses " +
+s"surrounding them; please add parentheses around these expressions 
and then retry the " +
+s"query again"

Review Comment:
   ```suggestion
 "The table function call includes a table argument with an invalid 
partitioning/ordering " +
   s"specification: the $clause clause included multiple expressions 
without parentheses " +
   "surrounding them; please add parentheses around these expressions 
and then retry the " +
   "query again"
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48180][SQL] Improve error when UDTF call with TABLE arg forgets parentheses around multiple PARTITION/ORDER BY exprs [spark]

2024-05-07 Thread via GitHub


HyukjinKwon commented on code in PR #46451:
URL: https://github.com/apache/spark/pull/46451#discussion_r1593312100


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala:
##
@@ -1648,6 +1648,20 @@ class AstBuilder extends DataTypeAstBuilder with 
SQLConfHelper with Logging {
   message = "ORDER BY cannot be specified unless either " +
 "PARTITION BY or WITH SINGLE PARTITION is also present",
   ctx = ctx.tableArgumentPartitioning)
+def invalidPartitionOrOrderingExpression(clause: String): String = {
+  s"The table function call includes a table argument with an invalid 
partitioning/ordering " +
+s"specification: the $clause clause included multiple expressions 
without parentheses " +
+s"surrounding them; please add parentheses around these expressions 
and then retry the " +
+s"query again"

Review Comment:
   ```suggestion
 "The table function call includes a table argument with an invalid 
partitioning/ordering " +
   s"specification: the $clause clause included multiple expressions 
without parentheses " +
   "surrounding them; please add parentheses around these expressions 
and then retry the " +
   "query again"
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [SPARK-48183][PYTHON][DOCS] Update error contribution guide to respect new error class file [spark]

2024-05-07 Thread via GitHub


itholic opened a new pull request, #46455:
URL: https://github.com/apache/spark/pull/46455

   ### What changes were proposed in this pull request?
   
   This PR proposes to update error contribution guide to respect new error 
class file.
   
   ### Why are the changes needed?
   
   We recently moved the error class definition from `error_classes.py` to 
`error-conditions.json`, but some of documentation still describes the old file.
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   No API changes, but the user-facing documentation will be updated.
   
   ### How was this patch tested?
   
   Passed `dev/lint-python`, and also the existing CI should pass.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48149][INFRA][FOLLOWUP] Use single quotation mark [spark]

2024-05-07 Thread via GitHub


dongjoon-hyun commented on PR #46454:
URL: https://github.com/apache/spark/pull/46454#issuecomment-2099625781

   Thank you!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48149][INFRA][FOLLOWUP] Use single quotation mark [spark]

2024-05-07 Thread via GitHub


HyukjinKwon closed pull request #46454: [SPARK-48149][INFRA][FOLLOWUP] Use 
single quotation mark
URL: https://github.com/apache/spark/pull/46454


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48149][INFRA][FOLLOWUP] Use single quotation mark [spark]

2024-05-07 Thread via GitHub


HyukjinKwon commented on PR #46454:
URL: https://github.com/apache/spark/pull/46454#issuecomment-2099623389

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48149][INFRA][FOLLOWUP] Use single quotation mark [spark]

2024-05-07 Thread via GitHub


dongjoon-hyun commented on PR #46454:
URL: https://github.com/apache/spark/pull/46454#issuecomment-2099617650

   Could you review this fix, @HyukjinKwon ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48149][INFRA][FOLLOWUP] Use single quotation mark [spark]

2024-05-07 Thread via GitHub


dongjoon-hyun commented on code in PR #46454:
URL: https://github.com/apache/spark/pull/46454#discussion_r1593303069


##
.github/workflows/build_python.yml:
##
@@ -34,9 +34,9 @@ jobs:
   fail-fast: false
   matrix:
 include:
-  - pyversion: ${{ github.event.schedule == '0 15 * * *' && "pypy3" }}
-  - pyversion: ${{ github.event.schedule == '0 17 * * *' && 
"python3.10" }}
-  - pyversion: ${{ github.event.schedule == '0 19 * * *' && 
"python3.12" }}
+  - pyversion: ${{ github.event.schedule == '0 15 * * *' && 'pypy3' }}
+  - pyversion: ${{ github.event.schedule == '0 17 * * *' && 
'python3.10' }}
+  - pyversion: ${{ github.event.schedule == '0 19 * * *' && 
'python3.12' }}

Review Comment:
   Single quotation mark will work like the following.
   
   
https://github.com/apache/spark/blob/84c5b919d99872858d2f98db21fd3482f27dcbfc/.github/workflows/build_and_test.yml#L202



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [SPARK-48149][INFRA][FOLLOWUP] Use single quotation mark [spark]

2024-05-07 Thread via GitHub


dongjoon-hyun opened a new pull request, #46454:
URL: https://github.com/apache/spark/pull/46454

   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48131][CORE][FOLLOWUP] Add a new configuration for the MDC key of Task Name [spark]

2024-05-07 Thread via GitHub


gengliangwang commented on PR #46446:
URL: https://github.com/apache/spark/pull/46446#issuecomment-2099613832

   Thanks a lot, @dongjoon-hyun !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48131][CORE][FOLLOWUP] Add a new configuration for the MDC key of Task Name [spark]

2024-05-07 Thread via GitHub


dongjoon-hyun commented on PR #46446:
URL: https://github.com/apache/spark/pull/46446#issuecomment-2099610305

   Merged to master. Thank you, @gengliangwang !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48131][CORE][FOLLOWUP] Add a new configuration for the MDC key of Task Name [spark]

2024-05-07 Thread via GitHub


dongjoon-hyun closed pull request #46446: [SPARK-48131][CORE][FOLLOWUP] Add a 
new configuration for the MDC key of Task Name
URL: https://github.com/apache/spark/pull/46446


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48131][CORE][FOLLOWUP] Add a new configuration for the MDC key of Task Name [spark]

2024-05-07 Thread via GitHub


dongjoon-hyun commented on PR #46446:
URL: https://github.com/apache/spark/pull/46446#issuecomment-2099609933

   To @mridulm , let me merge this first because @cloud-fan will cut Today.
   We can change the default value later before 4.0.0.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] Make some corrections in the docstring of pyspark DataStreamReader methods [spark]

2024-05-07 Thread via GitHub


HyukjinKwon commented on PR #46416:
URL: https://github.com/apache/spark/pull/46416#issuecomment-2099607939

   Thanks for the fix. Mind filing a JIRA please? see also 
https://spark.apache.org/contributing.html


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [DO-NOT-REVIEW][SPARK-48089][SS][CONNECT] Fix Streaming 3.5 <> 4.0 compatibility test [spark]

2024-05-07 Thread via GitHub


HyukjinKwon commented on PR #46444:
URL: https://github.com/apache/spark/pull/46444#issuecomment-2099606539

   https://github.com/WweiL/oss-spark/actions/runs/8995215790


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48126][Core] Make `spark.log.structuredLogging.enabled` effective [spark]

2024-05-07 Thread via GitHub


gengliangwang closed pull request #46452: [SPARK-48126][Core] Make 
`spark.log.structuredLogging.enabled` effective
URL: https://github.com/apache/spark/pull/46452


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48126][Core] Make `spark.log.structuredLogging.enabled` effective [spark]

2024-05-07 Thread via GitHub


gengliangwang commented on PR #46452:
URL: https://github.com/apache/spark/pull/46452#issuecomment-2099603466

   @dongjoon-hyun @HyukjinKwon Thanks for the review. I am merging this one to 
master


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [MINOR][PYTHON][TESTS] Remove the doc in error message tests to allow other PyArrow versions in tests [spark]

2024-05-07 Thread via GitHub


HyukjinKwon opened a new pull request, #46453:
URL: https://github.com/apache/spark/pull/46453

   ### What changes were proposed in this pull request?
   
   This PR is a minor change to support more PyArrow versions in the test.
   
   ### Why are the changes needed?
   
   To support more PyArrow versions in the test. it can fail: 
(https://github.com/HyukjinKwon/spark/actions/runs/8994639538/job/24708397027)
   
   ```
   Traceback (most recent call last):
 File 
"/home/runner/work/spark/spark-3.5/python/pyspark/sql/tests/pandas/test_pandas_cogrouped_map.py",
 line 585, in _test_merge_error
   self.__test_merge_error(
 File 
"/home/runner/work/spark/spark-3.5/python/pyspark/sql/tests/pandas/test_pandas_cogrouped_map.py",
 line 606, in __test_merge_error
   with self.assertRaisesRegex(error_class, error_message_regex):
   AssertionError: "Return type of the user-defined function should be 
pandas.DataFrame, but is int64." does not match "
 An exception was thrown from the Python worker. Please see the stack trace 
below.
   Traceback (most recent call last):
 File 
"/home/runner/work/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 
1834, in main
   process()
 File 
"/home/runner/work/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 
1826, in process
   serializer.dump_stream(out_iter, outfile)
 File 
"/home/runner/work/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py",
 line 531, in dump_stream
   return ArrowStreamSerializer.dump_stream(self, 
init_stream_yield_batches(), stream)
  

 File 
"/home/runner/work/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py",
 line 104, in dump_stream
   for batch in iterator:
 File 
"/home/runner/work/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py",
 line 524, in init_stream_yield_batches
   for series in iterator:
 File 
"/home/runner/work/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 
1694, in mapper
   return f(df1_keys, df1_vals, df2_keys, df2_vals)
  ^
 File 
"/home/runner/work/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 
370, in 
   return lambda kl, vl, kr, vr: [(wrapped(kl, vl, kr, vr), 
to_arrow_type(return_type))]
   ^^^
 File 
"/home/runner/work/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 
364, in wrapped
   verify_pandas_result(
 File 
"/home/runner/work/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 
234, in verify_pandas_result
   raise PySparkTypeError(
   pyspark.errors.exceptions.base.PySparkTypeError: [UDF_RETURN_TYPE] Return 
type of the user-defined function should be pandas.DataFrame, but is int.
   ```
   
   ### Does this PR introduce _any_ user-facing change?
   
   No, test-only.
   
   ### How was this patch tested?
   
   Ci should validate it.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [WIP][SPARK-24815] [CORE] Trigger Interval based DRA for Structured Streaming [spark]

2024-05-07 Thread via GitHub


pkotikalapudi commented on code in PR #42352:
URL: https://github.com/apache/spark/pull/42352#discussion_r1593285827


##
core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala:
##
@@ -340,6 +385,45 @@ private[spark] class ExecutorAllocationManager(
 }
   }
 
+  /**
+   * Maximum number of executors to be removed per dra evaluation.

Review Comment:
   Thanks for looking into this potential feature.
   
   Sounds fair. Will limit these configs just for streaming use-case. 
   
   >I do think it should be a separate pr and discussion that adds it to batch 
though.
   
   Sure. Maybe in future there are people requesting it for batch, we can 
change this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48152][BUILD] Make `spark-profiler` as a part of release and publish to maven central repo [spark]

2024-05-07 Thread via GitHub


panbingkun commented on PR #46402:
URL: https://github.com/apache/spark/pull/46402#issuecomment-2099547060

   > Let's bring this into 4.0.0-preview and get more feedback. We can revise 
later before 4.0.0.
   
   Thanks. ❤️


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48045][PYTHON] Pandas API groupby with multi-agg-relabel ignores as_index=False [spark]

2024-05-07 Thread via GitHub


HyukjinKwon closed pull request #46391: [SPARK-48045][PYTHON] Pandas API 
groupby with multi-agg-relabel ignores as_index=False
URL: https://github.com/apache/spark/pull/46391


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48045][PYTHON] Pandas API groupby with multi-agg-relabel ignores as_index=False [spark]

2024-05-07 Thread via GitHub


HyukjinKwon commented on PR #46391:
URL: https://github.com/apache/spark/pull/46391#issuecomment-2099529188

   Merged to master


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47240][CORE][PART1] Migrate logInfo with variables to structured logging framework [spark]

2024-05-07 Thread via GitHub


gengliangwang closed pull request #46362: [SPARK-47240][CORE][PART1] Migrate 
logInfo with variables to structured logging framework
URL: https://github.com/apache/spark/pull/46362


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47240][CORE][PART1] Migrate logInfo with variables to structured logging framework [spark]

2024-05-07 Thread via GitHub


gengliangwang commented on PR #46362:
URL: https://github.com/apache/spark/pull/46362#issuecomment-2099522988

   Thanks, merging to master


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48152][BUILD] Make `spark-profiler` as a part of release and publish to maven central repo [spark]

2024-05-07 Thread via GitHub


dongjoon-hyun commented on PR #46402:
URL: https://github.com/apache/spark/pull/46402#issuecomment-2099515307

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48152][BUILD] Make `spark-profiler` as a part of release and publish to maven central repo [spark]

2024-05-07 Thread via GitHub


dongjoon-hyun closed pull request #46402: [SPARK-48152][BUILD] Make 
`spark-profiler` as a part of release and publish to maven central repo
URL: https://github.com/apache/spark/pull/46402


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48045][PYTHON] Pandas API groupby with multi-agg-relabel ignores as_index=False [spark]

2024-05-07 Thread via GitHub


itholic commented on code in PR #46391:
URL: https://github.com/apache/spark/pull/46391#discussion_r1593236433


##
python/pyspark/pandas/groupby.py:
##
@@ -308,6 +308,7 @@ def aggregate(
 )
 
 if not self._as_index:
+index_cols = list(psdf.columns)

Review Comment:
   Sounds good. Thanks for addressing



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47960][SS] Allow chaining other stateful operators after transformWithState operator. [spark]

2024-05-07 Thread via GitHub


HeartSaVioR closed pull request #45376: [SPARK-47960][SS] Allow chaining other 
stateful operators after transformWithState operator.
URL: https://github.com/apache/spark/pull/45376


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47960][SS] Allow chaining other stateful operators after transformWIthState operator. [spark]

2024-05-07 Thread via GitHub


HeartSaVioR commented on PR #45376:
URL: https://github.com/apache/spark/pull/45376#issuecomment-2099511347

   Let's handle that later. Thanks! Merging to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47960][SS] Allow chaining other stateful operators after transformWIthState operator. [spark]

2024-05-07 Thread via GitHub


HeartSaVioR commented on PR #45376:
URL: https://github.com/apache/spark/pull/45376#issuecomment-2099511072

   Let's be sure to either 1) introduce a method to users which gives a 
watermark value before advancing (late events) or 2) construct a story for 
users to set the event time timestamp properly without watermark value. 
   @sahnib Could you please file a JIRA ticket with blocker priority?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [WIP] [SPARK-46884] Spark Connect - ExecutePlanRequest new property - job description [spark]

2024-05-07 Thread via GitHub


github-actions[bot] closed pull request #44909: [WIP] [SPARK-46884] Spark 
Connect - ExecutePlanRequest new property - job description
URL: https://github.com/apache/spark/pull/44909


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48170][PYTHON][CONNECT][TESTS] Enable `ArrowPythonUDFParityTests.test_err_return_type` [spark]

2024-05-07 Thread via GitHub


zhengruifeng commented on PR #46433:
URL: https://github.com/apache/spark/pull/46433#issuecomment-2099507206

   thanks @dongjoon-hyun !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47960][SS] Allow chaining other stateful operators after transformWIthState operator. [spark]

2024-05-07 Thread via GitHub


sahnib commented on code in PR #45376:
URL: https://github.com/apache/spark/pull/45376#discussion_r1593215731


##
sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala:
##
@@ -442,6 +442,16 @@ abstract class SparkStrategies extends 
QueryPlanner[SparkPlan] {
   case EventTimeWatermark(columnName, delay, child) =>
 EventTimeWatermarkExec(columnName, delay, planLater(child)) :: Nil
 
+  case UpdateEventTimeWatermarkColumn(columnName, delay, child) =>
+// we expect watermarkDelay to be resolved before physical planning.
+if (delay.isEmpty) {
+  // This is a sanity check. We should not reach here as delay is 
updated during
+  // query plan resolution in 
[[ResolveUpdateEventTimeWatermarkColumn]] Analyzer rule.
+  throw SparkException.internalError(
+"You hit a query analyzer bug. Please report your query to Spark 
user mailing list.")

Review Comment:
   Added more detail to the error. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48126][Core] Make `spark.log.structuredLogging.enabled` effective [spark]

2024-05-07 Thread via GitHub


gengliangwang commented on PR #46452:
URL: https://github.com/apache/spark/pull/46452#issuecomment-2099482455

   > Since this PR doesn't have any test coverage, could you elaborate more 
specifically in the PR description when this doesn't work?
   
   You are right. I just updated the description.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48126][Core] Make `spark.log.structuredLogging.enabled` effective [spark]

2024-05-07 Thread via GitHub


gengliangwang commented on PR #46452:
URL: https://github.com/apache/spark/pull/46452#issuecomment-2099472878

   cc @panbingkun 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [SPARK-48126][Core] Make `spark.log.structuredLogging.enabled` effective [spark]

2024-05-07 Thread via GitHub


gengliangwang opened a new pull request, #46452:
URL: https://github.com/apache/spark/pull/46452

   
   
   ### What changes were proposed in this pull request?
   
   Currently, the spark conf `spark.log.structuredLogging.enabled` is not 
taking effect. This PR is to fix it.
   Also, this PR enhances the doc for this configuration.
   
   ### Why are the changes needed?
   
   Bug fix. After the fix, the Spark conf `spark.log.structuredLogging.enabled` 
takes effect.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No
   
   ### How was this patch tested?
   
   Manual test.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   Generated-by: GPT-4
   I used GPT-4 to improve the documents.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48100][SQL] Fix issues in skipping nested structure fields not selected in schema [spark]

2024-05-07 Thread via GitHub


sandip-db commented on code in PR #46348:
URL: https://github.com/apache/spark/pull/46348#discussion_r1593198748


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/StaxXmlParser.scala:
##
@@ -397,8 +397,7 @@ class StaxXmlParser(
 row(anyIndex) = values :+ newValue
 }
   } else {
-StaxXmlParserUtils.skipChildren(parser)
-StaxXmlParserUtils.skipNextEndElement(parser, field, options)
+  StaxXmlParserUtils.skipChildren(parser, field, options)

Review Comment:
   super nit:
   ```suggestion
   StaxXmlParserUtils.skipChildren(parser, field, options)
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48180][SQL] Improve error when UDTF call with TABLE arg forgets parentheses around multiple PARTITION/ORDER BY exprs [spark]

2024-05-07 Thread via GitHub


dtenedor commented on PR #46451:
URL: https://github.com/apache/spark/pull/46451#issuecomment-2099469184

   cc @ueshin 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [SPARK-48180][SQL] Improve error when UDTF call with TABLE arg forgets parentheses around multiple PARTITION/ORDER BY exprs [spark]

2024-05-07 Thread via GitHub


dtenedor opened a new pull request, #46451:
URL: https://github.com/apache/spark/pull/46451

   ### What changes were proposed in this pull request?
   
   This PR improves the error message when a table-valued function call has a 
TABLE argument with a PARTITION BY or ORDER BY clause with more than one 
associated expression, but forgets parentheses around them.
   
   For example:
   
   ```
   SELECT * FROM testUDTF(
 TABLE(SELECT 1 AS device_id, 2 AS data_ds)
 WITH SINGLE PARTITION
 ORDER BY device_id, data_ds)
   ```
   
   This query previously returned an obscure, unrelated error:
   
   ```
   [UNSUPPORTED_SUBQUERY_EXPRESSION_CATEGORY.UNSUPPORTED_TABLE_ARGUMENT] 
Unsupported subquery expression: Table arguments are used in a function where 
they are not supported:
   'UnresolvedTableValuedFunction [tvf], [table-argument#338 [], 
'data_ds](https://issues.apache.org/jira/browse/SPARK-48180#338%20[],%20'data_ds),
 false
  +- Project [1 AS device_id#336, 2 AS 
data_ds#337](https://issues.apache.org/jira/browse/SPARK-48180#336,%202%20AS%20data_ds#337)
 +- OneRowRelation
   ```
   
   Now it returns a reasonable error:
   
   ```
   The table function call includes a table argument with an invalid 
partitioning/ordering specification: the ORDER BY clause included multiple 
expressions without parentheses surrounding them; please add parentheses around 
these expressions and then retry the query again. (line 4, pos 2)
   
   == SQL ==
   
   SELECT * FROM testUDTF(
 TABLE(SELECT 1 AS device_id, 2 AS data_ds)
 WITH SINGLE PARTITION
   --^^^
 ORDER BY device_id, data_ds)
   ```
   
   ### Why are the changes needed?
   
   Here we improve error messages for common SQL syntax mistakes.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes, see above.
   
   ### How was this patch tested?
   
   This PR adds test coverage.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [SPARK-48182][SQL] SQL (java side): Migrate `error/warn/info` with variables to structured logging framework [spark]

2024-05-07 Thread via GitHub


panbingkun opened a new pull request, #46450:
URL: https://github.com/apache/spark/pull/46450

   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47240][CORE][PART1] Migrate logInfo with variables to structured logging framework [spark]

2024-05-07 Thread via GitHub


zeotuan commented on PR #46362:
URL: https://github.com/apache/spark/pull/46362#issuecomment-2099461633

   @gengliangwang Thank you, LGTM!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48152][BUILD] Make `spark-profiler` as a part of release and publish to maven central repo [spark]

2024-05-07 Thread via GitHub


panbingkun commented on PR #46402:
URL: https://github.com/apache/spark/pull/46402#issuecomment-2099458571

   > Could you revise the PR title, @panbingkun ?
   
   Done.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48152][BUILD] Make `spark-profiler` as a part of release and publish to maven central repo [spark]

2024-05-07 Thread via GitHub


panbingkun commented on PR #46402:
URL: https://github.com/apache/spark/pull/46402#issuecomment-2099458474

   > I merged the following. Could you rebase this PR?
   > 
   > * [[SPARK-48165][BUILD] Update `ap-loader` to 3.0-9 
#46427](https://github.com/apache/spark/pull/46427)
   
   Done.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48167][PYTHON][TESTS][FOLLOWUP][3.5] Reformat test_readwriter.py to fix Python Linter error [spark]

2024-05-07 Thread via GitHub


HyukjinKwon commented on PR #46445:
URL: https://github.com/apache/spark/pull/46445#issuecomment-2099443047

   Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48178][INFRA][3.5] Run `build/scala-213/java-11-17` jobs of `branch-3.5` only if needed [spark]

2024-05-07 Thread via GitHub


dongjoon-hyun closed pull request #46449: [SPARK-48178][INFRA][3.5] Run 
`build/scala-213/java-11-17` jobs of `branch-3.5` only if needed
URL: https://github.com/apache/spark/pull/46449


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48178][INFRA][3.5] Run `build/scala-213/java-11-17` jobs of `branch-3.5` only if needed [spark]

2024-05-07 Thread via GitHub


dongjoon-hyun commented on PR #46449:
URL: https://github.com/apache/spark/pull/46449#issuecomment-2099440541

   Thank you so much, @viirya !
   
   Merged to branch-3.5.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48178][INFRA][3.5] Run `build/scala-213/java-11-17` jobs of `branch-3.5` only if needed [spark]

2024-05-07 Thread via GitHub


viirya commented on PR #46449:
URL: https://github.com/apache/spark/pull/46449#issuecomment-2099440306

   Looks good to me.
   
   On Tue, May 7, 2024 at 3:53 PM Dongjoon Hyun ***@***.***>
   wrote:
   
   > Could you review this PR, @viirya  ?
   >
   > —
   > Reply to this email directly, view it on GitHub
   > , or
   > unsubscribe
   > 

   > .
   > You are receiving this because you were mentioned.Message ID:
   > ***@***.***>
   >
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48178][INFRA][3.5] Run `build/scala-213/java-11-17` jobs of `branch-3.5` only if needed [spark]

2024-05-07 Thread via GitHub


dongjoon-hyun commented on PR #46449:
URL: https://github.com/apache/spark/pull/46449#issuecomment-2099439228

   Could you review this PR, @viirya ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47365][PYTHON] Add toArrowTable() DataFrame method to PySpark [spark]

2024-05-07 Thread via GitHub


HyukjinKwon commented on PR #45481:
URL: https://github.com/apache/spark/pull/45481#issuecomment-2099437696

   Okay I'm fine with this


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48173][SQL][3.5] CheckAnalysis should see the entire query plan [spark]

2024-05-07 Thread via GitHub


dongjoon-hyun commented on PR #46442:
URL: https://github.com/apache/spark/pull/46442#issuecomment-2099414711

   Merged to branch-3.5.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48173][SQL][3.5] CheckAnalysis should see the entire query plan [spark]

2024-05-07 Thread via GitHub


dongjoon-hyun closed pull request #46442: [SPARK-48173][SQL][3.5] CheckAnalysis 
should see the entire query plan
URL: https://github.com/apache/spark/pull/46442


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48179][INFRA][3.5] Pin `nbsphinx` to `0.9.3` [spark]

2024-05-07 Thread via GitHub


dongjoon-hyun closed pull request #46448: [SPARK-48179][INFRA][3.5] Pin 
`nbsphinx` to `0.9.3`
URL: https://github.com/apache/spark/pull/46448


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48179][INFRA][3.5] Pin `nbsphinx` to `0.9.3` [spark]

2024-05-07 Thread via GitHub


dongjoon-hyun commented on PR #46448:
URL: https://github.com/apache/spark/pull/46448#issuecomment-2099412406

   Thank you!
   Merged to branch-3.5.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48152][BUILD] Publish the module `spark-profiler` to `maven central repository` [spark]

2024-05-07 Thread via GitHub


dongjoon-hyun commented on PR #46402:
URL: https://github.com/apache/spark/pull/46402#issuecomment-2099411958

   Could you revise the PR title, @panbingkun ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48179][INFRA][3.5] Pin `nbsphinx` to `0.9.3` [spark]

2024-05-07 Thread via GitHub


dongjoon-hyun commented on PR #46448:
URL: https://github.com/apache/spark/pull/46448#issuecomment-2099406520

   Could you review this PR to recover `branch-3.5`, @huaxingao ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48179][INFRA][3.5] Pin `nbsphinx` to `0.9.3` [spark]

2024-05-07 Thread via GitHub


dongjoon-hyun commented on PR #46448:
URL: https://github.com/apache/spark/pull/46448#issuecomment-2099406269

   Linter passed.
   
   ![Screenshot 2024-05-07 at 15 16 
41](https://github.com/apache/spark/assets/9700541/fb1e0240-66d6-4d4e-a09e-a9e62bf26911)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [WIP][SPARK-47353][SQL] Enable collation support for the Mode expression [spark]

2024-05-07 Thread via GitHub


GideonPotok commented on code in PR #46404:
URL: https://github.com/apache/spark/pull/46404#discussion_r1593161711


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Mode.scala:
##
@@ -70,20 +78,46 @@ case class Mode(
 buffer
   }
 
-  override def eval(buffer: OpenHashMap[AnyRef, Long]): Any = {
-if (buffer.isEmpty) {
+  override def eval(buff: OpenHashMap[AnyRef, Long]): Any = {
+if (buff.isEmpty) {
   return null
 }
+val buffer = if (child.dataType.isInstanceOf[StringType] && 
collationEnabled) {
+  val modeMap = buff.foldLeft(
+new TreeMap[org.apache.spark.unsafe.types.UTF8String, 
Long]()(Ordering.comparatorToOrdering(
+  CollationFactory.fetchCollation(collationId).comparator

Review Comment:
   @uros-db `org.apache.spark.util.collection.OpenHashMap` is actually a spark 
data structure, so I could try to modify it in a second candidate PR and then 
see about having `Mode` rely on a `OpenHashMap` configured to be collation 
aware..
   
   The way I could  do it would be to modify   
`org.apache.spark.util.collection.OpenHashSet.Hasher.hash()` , specifically the 
override of hash within a class we would create `Hasher[String with 
Collation...]`, to branch to an alternative hashing method that is collation 
sensitive. 
   
   (I would not be modifying 
`org.apache.spark.util.collection.OpenHashSet#hashcode` , which distributes the 
hashes more evenly and is called on values returned by `Hasher.hash`) 
   
   This should work because `OpenHashMap` relies internally on `OpenHashSet` 
which uses `org.apache.spark.util.collection.OpenHashSet.Hasher`
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47976][Python] Add Python UDTF DataFrame API to return result of 'analyze' method as output table [spark]

2024-05-07 Thread via GitHub


dtenedor commented on PR #46234:
URL: https://github.com/apache/spark/pull/46234#issuecomment-2099359649

   cc @allisonwang-db @ueshin friendly ping, this is ready for review


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48177][BUILD] Upgrade `Apache Parquet` to 1.14.0 [spark]

2024-05-07 Thread via GitHub


Fokko commented on PR #46447:
URL: https://github.com/apache/spark/pull/46447#issuecomment-2099354844

   I have to look into the tests  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48177][BUILD] Upgrade `Apache Parquet` to 1.14.0 [spark]

2024-05-07 Thread via GitHub


Fokko commented on PR #46447:
URL: https://github.com/apache/spark/pull/46447#issuecomment-2099353902

   Thanks for pointing out @dongjoon-hyun. I've fixed it right away  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47240][CORE][PART1] Migrate logInfo with variables to structured logging framework [spark]

2024-05-07 Thread via GitHub


gengliangwang commented on PR #46362:
URL: https://github.com/apache/spark/pull/46362#issuecomment-2099352057

   @zeotuan FYI I pushed a commit 
https://github.com/apache/spark/pull/46362/commits/b45b340d93ff6c6a972e255ee5325ce1b1c7d226
 to reduce a log key. Please help review it and let me know what you think of 
it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46841][SQL] Add collation support for ICU locales and collation specifiers [spark]

2024-05-07 Thread via GitHub


nikolamand-db commented on code in PR #46180:
URL: https://github.com/apache/spark/pull/46180#discussion_r1593130617


##
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java:
##
@@ -117,76 +119,438 @@ public Collation(
 }
 
 /**
- * Constructor with comparators that are inherited from the given collator.
+ * collation id (32-bit integer) layout:
+ * bit 31:0 = predefined collation, 1 = user-defined collation
+ * bit 30-29: 00 = utf8-binary, 01 = ICU, 10 = indeterminate (without spec 
implementation)
+ * bit 28:0 for utf8-binary / 0 = case-sensitive, 1 = case-insensitive 
for ICU
+ * bit 27:0 for utf8-binary / 0 = accent-sensitive, 1 = 
accent-insensitive for ICU
+ * bit 26-25: zeroes, reserved for punctuation sensitivity
+ * bit 24-23: zeroes, reserved for first letter preference
+ * bit 22-21: 00 = unspecified, 01 = to-lower, 10 = to-upper
+ * bit 20-19: zeroes, reserved for space trimming
+ * bit 18-17: zeroes, reserved for version
+ * bit 16-12: zeroes
+ * bit 11-0:  zeroes for utf8-binary / locale id for ICU
  */
-public Collation(
-String collationName,
-Collator collator,
-String version,
-boolean supportsBinaryEquality,
-boolean supportsBinaryOrdering,
-boolean supportsLowercaseEquality) {
-  this(
-collationName,
-collator,
-(s1, s2) -> collator.compare(s1.toString(), s2.toString()),
-version,
-s -> (long)collator.getCollationKey(s.toString()).hashCode(),
-supportsBinaryEquality,
-supportsBinaryOrdering,
-supportsLowercaseEquality);
+private abstract static class CollationSpec {
+  protected enum ImplementationProvider {
+UTF8_BINARY, ICU, INDETERMINATE
+  }
+
+  protected enum CaseSensitivity {
+CS, CI
+  }
+
+  protected enum AccentSensitivity {
+AS, AI
+  }
+
+  protected enum CaseConversion {
+UNSPECIFIED, LCASE, UCASE
+  }
+
+  protected static final int IMPLEMENTATION_PROVIDER_OFFSET = 29;
+  protected static final int IMPLEMENTATION_PROVIDER_MASK = 0b11;
+  protected static final int CASE_SENSITIVITY_OFFSET = 28;
+  protected static final int CASE_SENSITIVITY_MASK = 0b1;
+  protected static final int ACCENT_SENSITIVITY_OFFSET = 27;
+  protected static final int ACCENT_SENSITIVITY_MASK = 0b1;
+  protected static final int CASE_CONVERSION_OFFSET = 21;
+  protected static final int CASE_CONVERSION_MASK = 0b11;
+  protected static final int LOCALE_OFFSET = 0;
+  protected static final int LOCALE_MASK = 0x0FFF;
+
+  protected static final int INDETERMINATE_COLLATION_ID =
+ImplementationProvider.INDETERMINATE.ordinal() << 
IMPLEMENTATION_PROVIDER_OFFSET;
+
+  protected final CaseSensitivity caseSensitivity;
+  protected final AccentSensitivity accentSensitivity;
+  protected final CaseConversion caseConversion;
+  protected final String locale;
+  protected final int collationId;
+
+  protected CollationSpec(
+  String locale,
+  CaseSensitivity caseSensitivity,
+  AccentSensitivity accentSensitivity,
+  CaseConversion caseConversion) {
+this.locale = locale;
+this.caseSensitivity = caseSensitivity;
+this.accentSensitivity = accentSensitivity;
+this.caseConversion = caseConversion;
+this.collationId = getCollationId();
+  }
+
+  private static final Map collationMap = new 
ConcurrentHashMap<>();

Review Comment:
   `ConcurrentHashMap` is used to enforce thread safety of static 
`fetchCollation` function. Added shortcut for `UTF8_BINARY` collation to skip 
using the map cache.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46841][SQL] Add collation support for ICU locales and collation specifiers [spark]

2024-05-07 Thread via GitHub


nikolamand-db commented on code in PR #46180:
URL: https://github.com/apache/spark/pull/46180#discussion_r1593127364


##
common/unsafe/src/test/scala/org/apache/spark/unsafe/types/CollationFactorySuite.scala:
##
@@ -30,31 +33,101 @@ import 
org.apache.spark.unsafe.types.UTF8String.{fromString => toUTF8}
 
 class CollationFactorySuite extends AnyFunSuite with Matchers { // 
scalastyle:ignore funsuite
   test("collationId stability") {
-val utf8Binary = fetchCollation(0)
+assert(UTF8_BINARY_COLLATION_ID == 0)
+assert(INDETERMINATE_COLLATION_ID == 1 << 30)
+
+val utf8Binary = fetchCollation(UTF8_BINARY_COLLATION_ID)
 assert(utf8Binary.collationName == "UTF8_BINARY")
 assert(utf8Binary.supportsBinaryEquality)
 
-val utf8BinaryLcase = fetchCollation(1)
+val utf8BinaryLcase = fetchCollation(UTF8_BINARY_LCASE_COLLATION_ID)
 assert(utf8BinaryLcase.collationName == "UTF8_BINARY_LCASE")
 assert(!utf8BinaryLcase.supportsBinaryEquality)
 
-val unicode = fetchCollation(2)
+val unicode = fetchCollation(UNICODE_COLLATION_ID)
 assert(unicode.collationName == "UNICODE")
-assert(unicode.supportsBinaryEquality);
+assert(unicode.supportsBinaryEquality)
 
-val unicodeCi = fetchCollation(3)
+val unicodeCi = fetchCollation(UNICODE_CI_COLLATION_ID)
 assert(unicodeCi.collationName == "UNICODE_CI")
 assert(!unicodeCi.supportsBinaryEquality)
   }
 
-  test("fetch invalid collation name") {
-val error = intercept[SparkException] {
-  fetchCollation("UTF8_BS")
+  test("UTF8_BINARY and ICU root locale collation names") {
+// collation name already normalized
+Seq(
+  "UTF8_BINARY",
+  "UTF8_BINARY_LCASE",
+  "UTF8_BINARY_UCASE",
+  "UNICODE",
+  "UNICODE_CI",
+  "UNICODE_AI",
+  "UNICODE_CI_AI",
+  "UNICODE_LCASE",
+  "UNICODE_UCASE"
+).foreach(collationName => {
+  val col = fetchCollation(collationName)
+  assert(col.collationName == collationName)
+})
+// collation name normalization
+Seq(
+  // UTF8_BINARY
+  ("UTF8_BINARY_CS", "UTF8_BINARY"),

Review Comment:
   Reworked the approach to reflect this. `_AI` and `_CI` will be disabled but 
`_CS` and `_AS` will be treated as default case and accent sensitive collations 
with no effect.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48177][BUILD] Upgrade `Apache Parquet` to 1.14.0 [spark]

2024-05-07 Thread via GitHub


dongjoon-hyun commented on PR #46447:
URL: https://github.com/apache/spark/pull/46447#issuecomment-2099344517

   Oh, it seems that wrong `target` folder files are added.
   
   FYI, this PR is supposed to have `pom.xml` and 
`dev/deps/spark-deps-hadoop-3-hive-2.3`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [SPARK-48178][INFRA] Run `build/scala-211/java-11-17` jobs of `branch-3.5` only if needed [spark]

2024-05-07 Thread via GitHub


dongjoon-hyun opened a new pull request, #46449:
URL: https://github.com/apache/spark/pull/46449

   ### What changes were proposed in this pull request?
   
   This PR aims to run `build`, `scala-213`, and `java-11-17` job of 
`branch-3.5` only if needed to reduce the maximum concurrency of Apache Spark 
GitHub Action usage.
   
   ### Why are the changes needed?
   
   To meet ASF Infra GitHub Action policy, we need to reduce the maximum 
concurrency.
   - https://infra.apache.org/github-actions-policy.html
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   Pass the CIs.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48177][BUILD] Upgrade `Apache Parquet` to 1.14.0 [spark]

2024-05-07 Thread via GitHub


dongjoon-hyun commented on PR #46447:
URL: https://github.com/apache/spark/pull/46447#issuecomment-2099295786

   cc @cloud-fan , @HyukjinKwon , @mridulm , @sunchao , @yaooqinn , 
@LuciferYang , @steveloughran , @viirya , @huaxin, @parthchandra , too.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [SPARK-XXX][INFRA][3.5] Pin `nbsphinx` to `0.9.3' [spark]

2024-05-07 Thread via GitHub


dongjoon-hyun opened a new pull request, #46448:
URL: https://github.com/apache/spark/pull/46448

   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48167][PYTHON][TESTS][FOLLOWUP][3.5] Reformat test_readwriter.py to fix Python Linter error [spark]

2024-05-07 Thread via GitHub


dongjoon-hyun closed pull request #46445: 
[SPARK-48167][PYTHON][TESTS][FOLLOWUP][3.5] Reformat test_readwriter.py to fix 
Python Linter error
URL: https://github.com/apache/spark/pull/46445


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48167][PYTHON][TESTS][FOLLOWUP][3.5] Reformat test_readwriter.py to fix Python Linter error [spark]

2024-05-07 Thread via GitHub


dongjoon-hyun commented on PR #46445:
URL: https://github.com/apache/spark/pull/46445#issuecomment-2099267678

   `Python Linter` passed and the doc generation issue is different one.
   
   Merged to branch-3.5.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48167][PYTHON][TESTS][FOLLOWUP][3.5] Reformat test_readwriter.py to fix Python Linter error [spark]

2024-05-07 Thread via GitHub


dongjoon-hyun commented on PR #46445:
URL: https://github.com/apache/spark/pull/46445#issuecomment-2099266436

   Thank you, @gengliangwang !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [SPARK-48177][BUILD]: Bump Apache Parquet to 1.14.0 [spark]

2024-05-07 Thread via GitHub


Fokko opened a new pull request, #46447:
URL: https://github.com/apache/spark/pull/46447

   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   Fixes quite a few bugs on the Parquet side: 
https://github.com/apache/parquet-mr/blob/master/CHANGES.md#version-1140
   
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   No
   
   
   
   ### How was this patch tested?
   
   Using the existing unit tests
   
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   
   No


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48167][PYTHON][TESTS][FOLLOWUP][3.5] Reformat test_readwriter.py to fix Python Linter error [spark]

2024-05-07 Thread via GitHub


dongjoon-hyun commented on PR #46445:
URL: https://github.com/apache/spark/pull/46445#issuecomment-2099246482

   Could you review this PR, @gengliangwang ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48031] view evolution [spark]

2024-05-07 Thread via GitHub


srielau commented on code in PR #46267:
URL: https://github.com/apache/spark/pull/46267#discussion_r1593035689


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/v2Commands.scala:
##
@@ -1304,7 +1315,8 @@ case class CreateView(
 originalText: Option[String],
 query: LogicalPlan,
 allowExisting: Boolean,
-replace: Boolean) extends BinaryCommand with CTEInChildren {
+replace: Boolean,
+viewSchemaMode: ViewSchemaMode) extends BinaryCommand with CTEInChildren {

Review Comment:
   We need this



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   >