Re: [PR] [SPARK-47839][SQL] Fix aggregate bug in RewriteWithExpression [spark]

2024-04-16 Thread via GitHub


cloud-fan commented on PR #46034:
URL: https://github.com/apache/spark/pull/46034#issuecomment-2060437664

   The test fails: `org.apache.spark.sql.connect.ProtoToParsedPlanTestSuite`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47839][SQL] Fix aggregate bug in RewriteWithExpression [spark]

2024-04-16 Thread via GitHub


cloud-fan commented on code in PR #46034:
URL: https://github.com/apache/spark/pull/46034#discussion_r1568259745


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteWithExpression.scala:
##
@@ -21,36 +21,68 @@ import scala.collection.mutable
 
 import org.apache.spark.SparkException
 import org.apache.spark.sql.catalyst.expressions._
-import org.apache.spark.sql.catalyst.plans.logical.{LogicalPlan, PlanHelper, 
Project}
+import org.apache.spark.sql.catalyst.planning.PhysicalAggregation
+import org.apache.spark.sql.catalyst.plans.logical.{Aggregate, LogicalPlan, 
PlanHelper, Project}
 import org.apache.spark.sql.catalyst.rules.Rule
 import org.apache.spark.sql.catalyst.trees.TreePattern.{COMMON_EXPR_REF, 
WITH_EXPRESSION}
 
 /**
  * Rewrites the `With` expressions by adding a `Project` to pre-evaluate the 
common expressions, or
  * just inline them if they are cheap.
  *
+ * Since this rule can introduce new `Project` operators, it is advised to run 
[[CollapseProject]]
+ * after this rule.
+ *
  * Note: For now we only use `With` in a few `RuntimeReplaceable` expressions. 
If we expand its
  *   usage, we should support aggregate/window functions as well.
  */
 object RewriteWithExpression extends Rule[LogicalPlan] {
   override def apply(plan: LogicalPlan): LogicalPlan = {
-
plan.transformDownWithSubqueriesAndPruning(_.containsPattern(WITH_EXPRESSION)) {
+
plan.transformUpWithSubqueriesAndPruning(_.containsPattern(WITH_EXPRESSION)) {
+  // For aggregates, separate the computation of the aggregations 
themselves from the final
+  // result by moving the final result computation into a projection above 
it. This prevents
+  // this rule from producing an invalid Aggregate operator.
+  case p @ PhysicalAggregation(
+  groupingExpressions, aggregateExpressions, resultExpressions, child)
+  if p.expressions.exists(_.containsPattern(WITH_EXPRESSION)) =>
+// There should not be dangling common expression references in the 
aggregate expressions.
+// This can happen if a With is created with an aggregate function in 
its child.
+assert(!aggregateExpressions.exists(ae =>

Review Comment:
   Shall we do the assert in the constructor of `With`, to fail earlier?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [SPARK-46812][CONNECT][PYTHON][FOLLOW-UP] Add pyspark.pyspark.sql.connect.resource into PyPi packaging [spark]

2024-04-16 Thread via GitHub


HyukjinKwon opened a new pull request, #46094:
URL: https://github.com/apache/spark/pull/46094

   ### What changes were proposed in this pull request?
   
   This PR proposes to add `pyspark.pyspark.sql.connect.resource` into PyPi 
packaging.
   
   ### Why are the changes needed?
   
   In order for PyPI end users to download PySpark and leverage this feature.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No, the main change has not been released.
   
   ### How was this patch tested?
   
   Being tested at https://github.com/apache/spark/pull/46090
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47591][SQL] Hive-thriftserver: Migrate logInfo with variables to structured logging framework [spark]

2024-04-16 Thread via GitHub


gengliangwang commented on PR #45926:
URL: https://github.com/apache/spark/pull/45926#issuecomment-2060387686

   @itholic Please resolve the conflict so that I can merge this one. Thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47588][CORE] Hive module: Migrate logInfo with variables to structured logging framework [spark]

2024-04-16 Thread via GitHub


gengliangwang closed pull request #46086: [SPARK-47588][CORE] Hive module: 
Migrate logInfo with variables to structured logging framework
URL: https://github.com/apache/spark/pull/46086


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47588][CORE] Hive module: Migrate logInfo with variables to structured logging framework [spark]

2024-04-16 Thread via GitHub


gengliangwang commented on PR #46086:
URL: https://github.com/apache/spark/pull/46086#issuecomment-2060385871

   @dongjoon-hyun @HyukjinKwon Thanks for the review. Merging to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [WIP][SPARK-47584][SQL] SQL core: Migrate logWarn with variables to structured logging framework [spark]

2024-04-16 Thread via GitHub


panbingkun commented on code in PR #46057:
URL: https://github.com/apache/spark/pull/46057#discussion_r1568232939


##
common/utils/src/main/scala/org/apache/spark/internal/LogKey.scala:
##
@@ -172,12 +228,16 @@ object LogKey extends Enumeration {
   val TOPIC_PARTITION = Value
   val TOTAL_EFFECTIVE_TIME = Value
   val TOTAL_TIME = Value
-  val UNSUPPORTED_EXPRESSION = Value

Review Comment:
   Let's unify it as an abbreviation
   `EXPRESSION` -> `EXPR`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-43861][CORE] Do not delete inprogress log [spark]

2024-04-16 Thread via GitHub


bluzy commented on PR #46025:
URL: https://github.com/apache/spark/pull/46025#issuecomment-2060354963

   @dongjoon-hyun @mridulm 
   I think incorrect inprogress file would be deleted on cleaner's schedule, 
isn't it?
   I concen that many spark streaming application can lives forever until needs 
to be upgraded. It's difficult to set `spark.history.fs.cleaner.maxAge` value 
while running streaming job.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [SPARK-47882][SQL] createTableColumnTypes need to be mapped to database types instead of using directly [spark]

2024-04-16 Thread via GitHub


yaooqinn opened a new pull request, #46093:
URL: https://github.com/apache/spark/pull/46093

   …
   
   
   
   ### What changes were proposed in this pull request?
   
   createTableColumnTypes contains Spark SQL data type definitions. The 
underlying database might not recognize them, boolean for Oracle(v < 23c).
   
   ### Why are the changes needed?
   
   
   bugfix
   ### Does this PR introduce _any_ user-facing change?
   
   no
   
   ### How was this patch tested?
   
   new test
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   no


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47880][SQL][DOCS] Oracle: Document Mapping Spark SQL Data Types to Oracle [spark]

2024-04-16 Thread via GitHub


yaooqinn closed pull request #46092: [SPARK-47880][SQL][DOCS] Oracle: Document 
Mapping Spark SQL Data Types to Oracle
URL: https://github.com/apache/spark/pull/46092


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47879][SQL] Oracle: Use VARCHAR2 instead of VARCHAR for VarcharType mapping [spark]

2024-04-16 Thread via GitHub


yaooqinn commented on PR #46091:
URL: https://github.com/apache/spark/pull/46091#issuecomment-2060344927

   Merged to master


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47880][SQL][DOCS] Oracle: Document Mapping Spark SQL Data Types to Oracle [spark]

2024-04-16 Thread via GitHub


yaooqinn commented on PR #46092:
URL: https://github.com/apache/spark/pull/46092#issuecomment-2060346342

   Merged to master, Thank you @dongjoon-hyun 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47879][SQL] Oracle: Use VARCHAR2 instead of VARCHAR for VarcharType mapping [spark]

2024-04-16 Thread via GitHub


yaooqinn closed pull request #46091: [SPARK-47879][SQL] Oracle: Use VARCHAR2 
instead of VARCHAR for VarcharType mapping
URL: https://github.com/apache/spark/pull/46091


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47858][SPARK-47852][PYTHON][SQL] Refactoring the structure for DataFrame error context [spark]

2024-04-16 Thread via GitHub


itholic commented on code in PR #46063:
URL: https://github.com/apache/spark/pull/46063#discussion_r1568210917


##
python/pyspark/errors/utils.py:
##
@@ -16,9 +16,14 @@
 #
 
 import re
-from typing import Dict, Match
-
+import functools
+import inspect
+from typing import Any, Callable, Dict, Match, TypeVar, Type
 from pyspark.errors.error_classes import ERROR_CLASSES_MAP
+from py4j.java_gateway import JavaClass

Review Comment:
   Sure, just updated



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47850][SQL] Support `spark.sql.hive.convertInsertingUnpartitionedTable` [spark]

2024-04-16 Thread via GitHub


pan3793 commented on PR #46052:
URL: https://github.com/apache/spark/pull/46052#issuecomment-2060319567

   cc @ulysses-you who made refactor on this part


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47880][SQL][DOCS] Oracle: Document Mapping Spark SQL Data Types to Oracle [spark]

2024-04-16 Thread via GitHub


yaooqinn commented on code in PR #46092:
URL: https://github.com/apache/spark/pull/46092#discussion_r1568183291


##
docs/sql-data-sources-jdbc.md:
##
@@ -1335,3 +1335,109 @@ as the activated JDBC Driver.
 
   
 
+
+### Mapping Spark SQL Data Types to Oracle
+
+The below table describes the data type conversions from Spark SQL Data Types 
to Oracle data types,
+when creating, altering, or writing data to an Oracle table using the built-in 
jdbc data source with
+the Oracle JDBC as the activated JDBC Driver.
+
+
+  
+
+  Spark SQL Data Type
+  Oracle Data Type
+  Remarks
+
+  
+  
+
+  BooleanType
+  NUMBER(1, 0)
+  BooleanType maps to NUMBER(1, 0) as BOOLEAN is introduced since 
Oracle Release 23c

Review Comment:
   Yes, previous versions didn't have BOOLEAN



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-43861][CORE] Do not delete inprogress log [spark]

2024-04-16 Thread via GitHub


dongjoon-hyun commented on PR #46025:
URL: https://github.com/apache/spark/pull/46025#issuecomment-2060295412

   Yes, Mridul's comment is correct. I believe the AS-IS behavior is robust and 
safe and intended one instead of a bug.
   
   WDTY, @bluzy ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47880][SQL] Oracle: Document Mapping Spark SQL Data Types to Oracle [spark]

2024-04-16 Thread via GitHub


dongjoon-hyun commented on code in PR #46092:
URL: https://github.com/apache/spark/pull/46092#discussion_r1568177905


##
docs/sql-data-sources-jdbc.md:
##
@@ -1335,3 +1335,109 @@ as the activated JDBC Driver.
 
   
 
+
+### Mapping Spark SQL Data Types to Oracle
+
+The below table describes the data type conversions from Spark SQL Data Types 
to Oracle data types,
+when creating, altering, or writing data to an Oracle table using the built-in 
jdbc data source with
+the Oracle JDBC as the activated JDBC Driver.
+
+
+  
+
+  Spark SQL Data Type
+  Oracle Data Type
+  Remarks
+
+  
+  
+
+  BooleanType
+  NUMBER(1, 0)
+  BooleanType maps to NUMBER(1, 0) as BOOLEAN is introduced since 
Oracle Release 23c

Review Comment:
   Is it because `Oracle Database 23c` is introduced on April 2023?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47880][SQL] Oracle: Document Mapping Spark SQL Data Types to Oracle [spark]

2024-04-16 Thread via GitHub


yaooqinn commented on code in PR #46092:
URL: https://github.com/apache/spark/pull/46092#discussion_r1568168654


##
docs/sql-data-sources-jdbc.md:
##
@@ -1335,3 +1335,109 @@ as the activated JDBC Driver.
 
   
 
+
+### Mapping Spark SQL Data Types to Oracle
+
+The below table describes the data type conversions from Spark SQL Data Types 
to Oracle data types,
+when creating, altering, or writing data to an Oracle table using the built-in 
jdbc data source with
+the Oracle JDBC as the activated JDBC Driver.
+
+
+  
+
+  Spark SQL Data Type
+  Oracle Data Type
+  Remarks
+
+  
+  
+
+  BooleanType
+  NUMBER(1, 0)
+  BooleanType maps to NUMBER(1, 0) as BOOLEAN is introduced since 
Oracle Release 23c
+
+
+  ByteType
+  NUMBER(3)
+  
+
+
+  ShortType
+  NUMBER(5)
+  
+
+
+  IntegerType
+  NUMBER(10)
+  
+
+
+  LongType
+  NUMBER(19)
+  
+
+
+  FloatType
+  NUMBER(19, 4)
+  
+
+
+  DoubleType
+  NUMBER(19, 4)
+  
+
+
+  DecimalType(p, s)
+  NUMBER(p,s)
+  
+
+
+  DateType
+  DATE
+  
+
+
+  TimestampType
+  TIMESTAMP WITH LOCAL TIME ZONE
+  
+
+
+  TimestampNTZType
+  TIMESTAMP
+  
+
+
+  StringType
+  VARCHAR2(255)
+  For historical reason, a string value has maximum 255 characters
+
+
+  BinaryType
+  BLOB
+  
+
+
+  CharType(n)
+  CHAR(n)
+  
+
+
+  VarcharType(n)
+  VARCHAR2(n)

Review Comment:
   It's VARCHAR2 after #46091 
   It does not make any difference to the users, so I didn't add any remarks as 
less is more.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47880][SQL] Oracle: Document Mapping Spark SQL Data Types to Oracle [spark]

2024-04-16 Thread via GitHub


yaooqinn commented on code in PR #46092:
URL: https://github.com/apache/spark/pull/46092#discussion_r1568169069


##
docs/sql-data-sources-jdbc.md:
##
@@ -1335,3 +1335,109 @@ as the activated JDBC Driver.
 
   
 
+
+### Mapping Spark SQL Data Types to Oracle
+
+The below table describes the data type conversions from Spark SQL Data Types 
to Oracle data types,
+when creating, altering, or writing data to an Oracle table using the built-in 
jdbc data source with
+the Oracle JDBC as the activated JDBC Driver.
+
+
+  
+
+  Spark SQL Data Type
+  Oracle Data Type
+  Remarks
+
+  
+  
+
+  BooleanType
+  NUMBER(1, 0)
+  BooleanType maps to NUMBER(1, 0) as BOOLEAN is introduced since 
Oracle Release 23c

Review Comment:
   we might introduce a configuration for users to map BOOLEAN instead.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47879][SQL] Oracle: Use VARCHAR2 instead of VARCHAR for VarcharType mapping [spark]

2024-04-16 Thread via GitHub


yaooqinn commented on PR #46091:
URL: https://github.com/apache/spark/pull/46091#issuecomment-2060277921

   Thank you @dongjoon-hyun 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [SPARK-47880][SQL] Oracle: Document Mapping Spark SQL Data Types to Oracle [spark]

2024-04-16 Thread via GitHub


yaooqinn opened a new pull request, #46092:
URL: https://github.com/apache/spark/pull/46092

   
   
   
   
   ### What changes were proposed in this pull request?
   
   
   Documents Mapping Spark SQL Data Types to Oracle
   
   ### Why are the changes needed?
   
   documentation improvement
   
   ### Does this PR introduce _any_ user-facing change?
   no
   
   
   ### How was this patch tested?
   
   doc build
   
![image](https://github.com/apache/spark/assets/8326978/3bc7816a-f887-4f8c-95c0-15ec2c4ac7fb)
   
   ### Was this patch authored or co-authored using generative AI tooling?
   no
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47858][SPARK-47852][PYTHON][SQL] Refactoring the structure for DataFrame error context [spark]

2024-04-16 Thread via GitHub


HyukjinKwon commented on code in PR #46063:
URL: https://github.com/apache/spark/pull/46063#discussion_r1568166135


##
python/pyspark/errors/utils.py:
##
@@ -16,9 +16,14 @@
 #
 
 import re
-from typing import Dict, Match
-
+import functools
+import inspect
+from typing import Any, Callable, Dict, Match, TypeVar, Type
 from pyspark.errors.error_classes import ERROR_CLASSES_MAP
+from py4j.java_gateway import JavaClass

Review Comment:
   Can we hide this under `typing.TYPE_CHECKING`? For pure Python library 
package, we should avoid this whenever possible. Otherwise, the scheduled job 
would fail.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47858][SPARK-47852][PYTHON][SQL] Refactoring the structure for DataFrame error context [spark]

2024-04-16 Thread via GitHub


HyukjinKwon commented on code in PR #46063:
URL: https://github.com/apache/spark/pull/46063#discussion_r1568165854


##
python/pyspark/errors/utils.py:
##
@@ -119,3 +124,68 @@ def get_message_template(self, error_class: str) -> str:
 message_template = main_message_template + " " + 
sub_message_template
 
 return message_template
+
+
+def _capture_call_site(pyspark_origin: JavaClass, fragment: str) -> None:
+"""
+Capture the call site information including file name, line number, and 
function name.
+This function updates the thread-local storage from JVM side 
(PySparkCurrentOrigin)
+with the current call site information when a PySpark API function is 
called.
+
+Parameters
+--
+pyspark_origin : JavaClass

Review Comment:
   ```suggestion
   pyspark_origin : py4j.JavaClass
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47858][SPARK-47852][PYTHON][SQL] Refactoring the structure for DataFrame error context [spark]

2024-04-16 Thread via GitHub


itholic commented on code in PR #46063:
URL: https://github.com/apache/spark/pull/46063#discussion_r1568162931


##
python/pyspark/sql/tests/test_dataframe.py:
##
@@ -1011,36 +1011,6 @@ def test_dataframe_error_context(self):
 pyspark_fragment="eqNullSafe",
 )
 
-# DataFrameQueryContext with pysparkLoggingInfo - and
-with self.assertRaises(AnalysisException) as pe:
-df.withColumn("and_invalid_type", df.id & "string").collect()
-self.check_error(
-exception=pe.exception,
-error_class="DATATYPE_MISMATCH.BINARY_OP_WRONG_TYPE",
-message_parameters={
-"inputType": '"BOOLEAN"',
-"actualDataType": '"BIGINT"',
-"sqlExpr": '"(id AND string)"',
-},
-query_context_type=QueryContextType.DataFrame,
-pyspark_fragment="and",
-)
-
-# DataFrameQueryContext with pysparkLoggingInfo - or
-with self.assertRaises(AnalysisException) as pe:
-df.withColumn("or_invalid_type", df.id | "string").collect()
-self.check_error(
-exception=pe.exception,
-error_class="DATATYPE_MISMATCH.BINARY_OP_WRONG_TYPE",
-message_parameters={
-"inputType": '"BOOLEAN"',
-"actualDataType": '"BIGINT"',
-"sqlExpr": '"(id OR string)"',
-},
-query_context_type=QueryContextType.DataFrame,
-pyspark_fragment="or",
-)

Review Comment:
   In short, there was a change in the error message in the previous PR, but 
the purpose of this work is to only provide additional error context to the 
existing error message, so it was actually an unexpected change.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47838][BUILD] Upgrade `rocksdbjni` to 8.11.4 [spark]

2024-04-16 Thread via GitHub


dongjoon-hyun commented on PR #46065:
URL: https://github.com/apache/spark/pull/46065#issuecomment-2060267298

   Merged to master for Apache Spark 4.0.0.
   
   Thank YOU for the contribution, @neilramaswamy .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47838][BUILD] Upgrade `rocksdbjni` to 8.11.4 [spark]

2024-04-16 Thread via GitHub


dongjoon-hyun closed pull request #46065: [SPARK-47838][BUILD] Upgrade 
`rocksdbjni` to 8.11.4
URL: https://github.com/apache/spark/pull/46065


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47858][SPARK-47852][PYTHON][SQL] Refactoring the structure for DataFrame error context [spark]

2024-04-16 Thread via GitHub


itholic commented on code in PR #46063:
URL: https://github.com/apache/spark/pull/46063#discussion_r1568157339


##
python/pyspark/sql/tests/test_dataframe.py:
##
@@ -1011,36 +1011,6 @@ def test_dataframe_error_context(self):
 pyspark_fragment="eqNullSafe",
 )
 
-# DataFrameQueryContext with pysparkLoggingInfo - and
-with self.assertRaises(AnalysisException) as pe:
-df.withColumn("and_invalid_type", df.id & "string").collect()
-self.check_error(
-exception=pe.exception,
-error_class="DATATYPE_MISMATCH.BINARY_OP_WRONG_TYPE",
-message_parameters={
-"inputType": '"BOOLEAN"',
-"actualDataType": '"BIGINT"',
-"sqlExpr": '"(id AND string)"',
-},
-query_context_type=QueryContextType.DataFrame,
-pyspark_fragment="and",
-)
-
-# DataFrameQueryContext with pysparkLoggingInfo - or
-with self.assertRaises(AnalysisException) as pe:
-df.withColumn("or_invalid_type", df.id | "string").collect()
-self.check_error(
-exception=pe.exception,
-error_class="DATATYPE_MISMATCH.BINARY_OP_WRONG_TYPE",
-message_parameters={
-"inputType": '"BOOLEAN"',
-"actualDataType": '"BIGINT"',
-"sqlExpr": '"(id OR string)"',
-},
-query_context_type=QueryContextType.DataFrame,
-pyspark_fragment="or",
-)

Review Comment:
   In the previous PR, the error message was changed because we directly called 
the JVM's "fn" function.
   
   **Before the previous PR**
   ```
   >>> df.withColumn("or_invalid_type", df.id | "string").collect()
   Traceback (most recent call last):
   py4j.protocol.Py4JError: An error occurred while calling o40.or. Trace:
   py4j.Py4JException: Method or([class java.lang.String]) does not exist
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:321)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:329)
at py4j.Gateway.invoke(Gateway.java:274)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at 
py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
at java.base/java.lang.Thread.run(Thread.java:840)
   ```
   
   **After the previous PR**
   ```
   >>> df.withColumn("or_invalid_type", df.id | "string").collect()
   Traceback (most recent call last):
   pyspark.errors.exceptions.captured.AnalysisException: 
[DATATYPE_MISMATCH.BINARY_OP_DIFF_TYPES] Cannot resolve "(id OR string)" due to 
data type mismatch: the left and right operands of the binary operator have 
incompatible types ("BIGINT" and "STRING"). SQLSTATE: 42K09;
   'Project [id#0L, (id#0L OR string) AS or_invalid_type#2]
   +- Range (0, 10, step=1, splits=Some(12))
   ```
   
   We changed the approach which is not calling "fn" directly from current PR, 
so the error message has returned to its original state.
   
   Additionally, JVM also doesn't have `DataFrameQueryContext` for "and", "or" 
and "not" functions which is aligning with current behavior.
   
   I guess such an exception for `and`, `or`, `not` is caused by:
   
   
https://github.com/apache/spark/blob/6c827c10dc15e03178277a415c0e26e2d9d3a2f9/python/pyspark/sql/column.py#L402-L408
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47765][SQL] Add SET COLLATION to parser rules [spark]

2024-04-16 Thread via GitHub


cloud-fan commented on PR #45946:
URL: https://github.com/apache/spark/pull/45946#issuecomment-2060252622

   Shall we fail this command if the string collation feature flag is turned 
off?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [SPARK-47879][SQL] Oracle: Use VARCHAR2 instead of VARCHAR for VarcharType mapping [spark]

2024-04-16 Thread via GitHub


yaooqinn opened a new pull request, #46091:
URL: https://github.com/apache/spark/pull/46091

   
   
   
   
   ### What changes were proposed in this pull request?
   
   
   Use VARCHAR2 instead of VARCHAR for VarcharType mapping on the write-side. 
VARCHAR is a synonym of VARCHAR2 but it's unsteady in future Oracle releases.
   
   ### Why are the changes needed?
   
   Oracle official documentation says:
   
   > Do not use the VARCHAR data type. Use the VARCHAR2 data type instead. 
   
   ### Does this PR introduce _any_ user-facing change?
   
   no
   
   ### How was this patch tested?
   
   new tests
   
   ### Was this patch authored or co-authored using generative AI tooling?
   no
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47871][SQL] Oracle: Map TimestampType to TIMESTAMP WITH LOCAL TIME ZONE [spark]

2024-04-16 Thread via GitHub


yaooqinn commented on PR #46080:
URL: https://github.com/apache/spark/pull/46080#issuecomment-2060233886

   Thank you very much as always @dongjoon-hyun 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47858][SPARK-47852][PYTHON][SQL] Refactoring the structure for DataFrame error context [spark]

2024-04-16 Thread via GitHub


ueshin commented on code in PR #46063:
URL: https://github.com/apache/spark/pull/46063#discussion_r1568131652


##
python/pyspark/sql/tests/test_dataframe.py:
##
@@ -1011,36 +1011,6 @@ def test_dataframe_error_context(self):
 pyspark_fragment="eqNullSafe",
 )
 
-# DataFrameQueryContext with pysparkLoggingInfo - and
-with self.assertRaises(AnalysisException) as pe:
-df.withColumn("and_invalid_type", df.id & "string").collect()
-self.check_error(
-exception=pe.exception,
-error_class="DATATYPE_MISMATCH.BINARY_OP_WRONG_TYPE",
-message_parameters={
-"inputType": '"BOOLEAN"',
-"actualDataType": '"BIGINT"',
-"sqlExpr": '"(id AND string)"',
-},
-query_context_type=QueryContextType.DataFrame,
-pyspark_fragment="and",
-)
-
-# DataFrameQueryContext with pysparkLoggingInfo - or
-with self.assertRaises(AnalysisException) as pe:
-df.withColumn("or_invalid_type", df.id | "string").collect()
-self.check_error(
-exception=pe.exception,
-error_class="DATATYPE_MISMATCH.BINARY_OP_WRONG_TYPE",
-message_parameters={
-"inputType": '"BOOLEAN"',
-"actualDataType": '"BIGINT"',
-"sqlExpr": '"(id OR string)"',
-},
-query_context_type=QueryContextType.DataFrame,
-pyspark_fragment="or",
-)

Review Comment:
   I'm wondering why we remove these tests?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47838][BUILD] Upgrade `rocksdbjni` to 8.11.4 [spark]

2024-04-16 Thread via GitHub


neilramaswamy commented on PR #46065:
URL: https://github.com/apache/spark/pull/46065#issuecomment-2060226204

   @dongjoon-hyun, should be ready to merge now. Appreciate your feedback!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47858][SPARK-47852][PYTHON][SQL] Refactoring the structure for DataFrame error context [spark]

2024-04-16 Thread via GitHub


itholic commented on code in PR #46063:
URL: https://github.com/apache/spark/pull/46063#discussion_r1568119580


##
sql/api/src/main/scala/org/apache/spark/sql/catalyst/trees/QueryContexts.scala:
##
@@ -160,6 +160,8 @@ case class DataFrameQueryContext(
   val pysparkFragment: String = pysparkErrorContext.map(_._1).getOrElse("")
   val pysparkCallSite: String = pysparkErrorContext.map(_._2).getOrElse("")
 
+  PySparkCurrentOrigin.clear()

Review Comment:
   Yeah, `PySparkCurrentOrigin` should be cleared on the Python side to ensure 
consistent operation.
   
   It is working properly now. Thanks, @ueshin 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47870][SQL] Optimize predicate after push extra predicate through join [spark]

2024-04-16 Thread via GitHub


zml1206 commented on PR #46085:
URL: https://github.com/apache/spark/pull/46085#issuecomment-2060202987

   cc @cloud-fan 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47870][SQL] Optimize predicate after push extra predicate through join [spark]

2024-04-16 Thread via GitHub


zml1206 commented on code in PR #46085:
URL: https://github.com/apache/spark/pull/46085#discussion_r1568115344


##
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/FilterPushdownSuite.scala:
##
@@ -46,15 +46,30 @@ class FilterPushdownSuite extends PlanTest {
 PushDownPredicates) :: Nil
   }
 
+  object PushExtraPredicateThroughJoinOptimize extends 
RuleExecutor[LogicalPlan] {

Review Comment:
   In order not to affect the previous UT, add a new Optimize.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-43861][CORE] Do not delete inprogress log [spark]

2024-04-16 Thread via GitHub


mridulm commented on PR #46025:
URL: https://github.com/apache/spark/pull/46025#issuecomment-2060196960

   Note that when driver crashes, the event file remains with `.inprogress` 
suffix.
   Not deleting these files would result in filling up the event directory - 
and eventually fail all jobs (depending on the fs limits for number of files 
per dir).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47858][SPARK-47852][PYTHON][SQL] Refactoring the structure for DataFrame error context [spark]

2024-04-16 Thread via GitHub


itholic commented on code in PR #46063:
URL: https://github.com/apache/spark/pull/46063#discussion_r1568104641


##
sql/api/src/main/scala/org/apache/spark/sql/catalyst/trees/QueryContexts.scala:
##
@@ -160,6 +160,8 @@ case class DataFrameQueryContext(
   val pysparkFragment: String = pysparkErrorContext.map(_._1).getOrElse("")
   val pysparkCallSite: String = pysparkErrorContext.map(_._2).getOrElse("")
 
+  PySparkCurrentOrigin.clear()

Review Comment:
   Thanks @ueshin, will try 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47810][SQL] Replace equivalent expression to <=> in join condition [spark]

2024-04-16 Thread via GitHub


cloud-fan commented on code in PR #45999:
URL: https://github.com/apache/spark/pull/45999#discussion_r1568103738


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeJoinCondition.scala:
##
@@ -0,0 +1,45 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.optimizer
+
+import org.apache.spark.sql.catalyst.expressions.{And, EqualNullSafe, EqualTo, 
IsNull, Or, PredicateHelper}
+import org.apache.spark.sql.catalyst.plans.logical.{Join, LogicalPlan}
+import org.apache.spark.sql.catalyst.rules.Rule
+import org.apache.spark.sql.catalyst.trees.TreePattern.{JOIN, OR}
+
+/**
+ * Replaces `t1.id is null and t2.id is null or t1.id = t2.id` to `t1.id <=> 
t2.id`
+ * in join condition for better performance.
+ */
+object OptimizeJoinCondition extends Rule[LogicalPlan] with PredicateHelper {
+  override def apply(plan: LogicalPlan): LogicalPlan = 
plan.transformWithPruning(

Review Comment:
   We can easily disable a QO rule by setting 
`spark.sql.optimizer.excludedRules`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47767][SQL] Show offset value in TakeOrderedAndProjectExec [spark]

2024-04-16 Thread via GitHub


guixiaowen commented on PR #45931:
URL: https://github.com/apache/spark/pull/45931#issuecomment-2060168379

   > Could you add one test case like `EXPLAIN ... LIMIT ... OFFSET ... ORDER 
BY ...` at 
https://github.com/apache/spark/blob/master/sql/core/src/test/resources/sql-tests/inputs/explain.sql
 ?
   
   @beliefer Thank you for you review. I just changed the prompt message. This 
will not affect the UT test.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47810][SQL] Replace equivalent expression to <=> in join condition [spark]

2024-04-16 Thread via GitHub


zml1206 commented on code in PR #45999:
URL: https://github.com/apache/spark/pull/45999#discussion_r1568084128


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeJoinCondition.scala:
##
@@ -0,0 +1,45 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.optimizer
+
+import org.apache.spark.sql.catalyst.expressions.{And, EqualNullSafe, EqualTo, 
IsNull, Or, PredicateHelper}
+import org.apache.spark.sql.catalyst.plans.logical.{Join, LogicalPlan}
+import org.apache.spark.sql.catalyst.rules.Rule
+import org.apache.spark.sql.catalyst.trees.TreePattern.{JOIN, OR}
+
+/**
+ * Replaces `t1.id is null and t2.id is null or t1.id = t2.id` to `t1.id <=> 
t2.id`
+ * in join condition for better performance.
+ */
+object OptimizeJoinCondition extends Rule[LogicalPlan] with PredicateHelper {
+  override def apply(plan: LogicalPlan): LogicalPlan = 
plan.transformWithPruning(

Review Comment:
   Sorry, I don’t understand what kind of flag it is. Can you give me an 
example? Thank you.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47876][PYTHON][DOCS] Improve docstring of mapInArrow [spark]

2024-04-16 Thread via GitHub


xinrong-meng closed pull request #46088: [SPARK-47876][PYTHON][DOCS] Improve 
docstring of mapInArrow
URL: https://github.com/apache/spark/pull/46088


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47876][PYTHON][DOCS] Improve docstring of mapInArrow [spark]

2024-04-16 Thread via GitHub


xinrong-meng commented on PR #46088:
URL: https://github.com/apache/spark/pull/46088#issuecomment-2060150462

   Thank you all, merged to master!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47810][SQL] Replace equivalent expression to <=> in join condition [spark]

2024-04-16 Thread via GitHub


zml1206 commented on code in PR #45999:
URL: https://github.com/apache/spark/pull/45999#discussion_r1568083598


##
sql/core/src/test/scala/org/apache/spark/sql/DataFrameJoinSuite.scala:
##
@@ -621,4 +621,14 @@ class DataFrameJoinSuite extends QueryTest
 checkAnswer(joined, Row("x", null, null))
 checkAnswer(joined.filter($"new".isNull), Row("x", null, null))
   }
+
+  test("SPARK-47810: replace equivalent expression to <=> in join condition") {

Review Comment:
   Added.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46375][DOCS] Add user guide for Python data source API [spark]

2024-04-16 Thread via GitHub


HyukjinKwon commented on code in PR #46089:
URL: https://github.com/apache/spark/pull/46089#discussion_r1568048732


##
python/docs/source/user_guide/sql/python_data_source.rst:
##
@@ -0,0 +1,139 @@
+..  Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+..http://www.apache.org/licenses/LICENSE-2.0
+
+..  Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+
+==
+Python Data Source API
+==
+
+.. currentmodule:: pyspark.sql
+
+Overview
+
+The Python Data Source API is a new feature introduced in Spark 4.0, enabling 
developers to read from custom data sources and write to custom data sinks in 
Python.
+This guide provides a comprehensive overview of the API and instructions on 
how to create, use, and manage Python data sources.
+
+
+Creating a Python Data Source
+-
+To create a custom Python data source, you'll need to subclass the 
:class:`DataSource` base classes and implement the necessary methods for 
reading and writing data.
+
+This example demonstrates creating a simple data source to generate synthetic 
data using the `faker` library. Ensure the `faker` library is installed and 
accessible in your Python environment.
+
+**Step 1: Define the Data Source**
+
+Start by creating a new subclass of :class:`DataSource`. Define the source 
name, schema, and reader logic as follows:
+
+.. code-block:: python
+
+from pyspark.sql.datasource import DataSource, DataSourceReader
+from pyspark.sql.types import StructType
+
+class FakeDataSource(DataSource):
+"""
+A fake data source for PySpark to generate synthetic data using the 
`faker` library.
+Options:
+- numRows: specify number of rows to generate. Default value is 3.
+"""
+
+@classmethod
+def name(cls):
+return "fake"
+
+def schema(self):
+return "name string, date string, zipcode string, state string"
+
+def reader(self, schema: StructType):
+return FakeDataSourceReader(schema, self.options)
+
+
+**Step 2: Implement the Reader**
+
+Define the reader logic to generate synthetic data. Use the `faker` library to 
populate each field in the schema.
+
+.. code-block:: python
+
+class FakeDataSourceReader(DataSourceReader):
+
+def __init__(self, schema, options):
+self.schema: StructType = schema
+self.options = options
+
+def read(self, partition):
+from faker import Faker
+fake = Faker()
+# Note: every value in this `self.options` dictionary is a string.
+num_rows = int(self.options.get("numRows", 3))
+for _ in range(num_rows):
+row = []
+for field in self.schema.fields:
+value = getattr(fake, field.name)()
+row.append(value)
+yield tuple(row)
+
+
+Using a Python Data Source

Review Comment:
   Could we add a small section for `pip install` case too? I have an example 
at https://github.com/hyukjinkwon/pyspark-jira. We should probably say that 
this is not supported with Spark Connect client side (thus it has to be 
installed in Spark Connect server side).



##
python/docs/source/user_guide/sql/python_data_source.rst:
##
@@ -0,0 +1,139 @@
+..  Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+..http://www.apache.org/licenses/LICENSE-2.0
+
+..  Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+
+==
+Python Data Source API
+==
+
+.. currentmodule:: pyspark.sql
+
+Overview
+
+The Python 

Re: [PR] [SPARK-46375][DOCS] Add user guide for Python data source API [spark]

2024-04-16 Thread via GitHub


HyukjinKwon commented on PR #46089:
URL: https://github.com/apache/spark/pull/46089#issuecomment-2060146837

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47876][PYTHON][DOCS] Improve docstring of mapInArrow [spark]

2024-04-16 Thread via GitHub


xinrong-meng commented on PR #46088:
URL: https://github.com/apache/spark/pull/46088#issuecomment-2060148294

   Good idea! I'll file a separate PR @zhengruifeng thanks!
   
   Thanks @allisonwang-db I'll create tickets under the umbrella.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46375][DOCS] Add user guide for Python data source API [spark]

2024-04-16 Thread via GitHub


HyukjinKwon closed pull request #46089: [SPARK-46375][DOCS] Add user guide for 
Python data source API
URL: https://github.com/apache/spark/pull/46089


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [WIP][SPARK-47763][CONNECT][TESTS] Enable local-cluster tests with pyspark-connect package [spark]

2024-04-16 Thread via GitHub


HyukjinKwon opened a new pull request, #46090:
URL: https://github.com/apache/spark/pull/46090

   ### What changes were proposed in this pull request?
   
   TBD
   
   ### Why are the changes needed?
   
   TBD
   
   ### Does this PR introduce _any_ user-facing change?
   
   TBD
   
   ### How was this patch tested?
   
   TBD
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   TBD
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47846][SQL] Add support for Variant type in from_json expression [spark]

2024-04-16 Thread via GitHub


harshmotw-db commented on PR #46046:
URL: https://github.com/apache/spark/pull/46046#issuecomment-2060102585

   @chenhao-db can you please look at this whenever you're free?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47418][SQL] Add hand-crafted implementations for lowercase uni… [spark]

2024-04-16 Thread via GitHub


HyukjinKwon commented on PR #46082:
URL: https://github.com/apache/spark/pull/46082#issuecomment-2060101653

   Mind making the PR title complete? It's truncated.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47416][SQL] Add new functions to CollationBenchmark [spark]

2024-04-16 Thread via GitHub


HyukjinKwon commented on code in PR #46078:
URL: https://github.com/apache/spark/pull/46078#discussion_r1568055547


##
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/CollationBenchmark.scala:
##
@@ -100,6 +100,90 @@ abstract class CollationBenchmarkBase extends 
BenchmarkBase {
 )
 benchmark.run()
   }
+
+  def benchmarkUTFStringContains(
+  collationTypes: Seq[String],
+  utf8Strings: Seq[UTF8String]): Unit = {
+val sublistStrings = utf8Strings
+
+val benchmark = new Benchmark(
+  "collation unit benchmarks - contains",
+  utf8Strings.size * 10,
+  warmupTime = 10.seconds,
+  output = output)
+collationTypes.foreach(collationType => {

Review Comment:
   can you use `foreach { collationType =>` instead



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47845][SQL][PYTHON][CONNECT] Support Column type in split function for scala and python [spark]

2024-04-16 Thread via GitHub


HyukjinKwon commented on PR #46045:
URL: https://github.com/apache/spark/pull/46045#issuecomment-2060090501

   I am fine with this change


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46375][DOCS] Add user guide for Python data source API [spark]

2024-04-16 Thread via GitHub


HyukjinKwon commented on code in PR #46089:
URL: https://github.com/apache/spark/pull/46089#discussion_r1568048989


##
python/docs/source/user_guide/sql/python_data_source.rst:
##
@@ -0,0 +1,139 @@
+..  Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+..http://www.apache.org/licenses/LICENSE-2.0
+
+..  Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+
+==
+Python Data Source API
+==
+
+.. currentmodule:: pyspark.sql
+
+Overview
+
+The Python Data Source API is a new feature introduced in Spark 4.0, enabling 
developers to read from custom data sources and write to custom data sinks in 
Python.
+This guide provides a comprehensive overview of the API and instructions on 
how to create, use, and manage Python data sources.
+
+
+Creating a Python Data Source

Review Comment:
   I would have a guide for static vs runtime cases.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46375][DOCS] Add user guide for Python data source API [spark]

2024-04-16 Thread via GitHub


HyukjinKwon commented on code in PR #46089:
URL: https://github.com/apache/spark/pull/46089#discussion_r1568048732


##
python/docs/source/user_guide/sql/python_data_source.rst:
##
@@ -0,0 +1,139 @@
+..  Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+..http://www.apache.org/licenses/LICENSE-2.0
+
+..  Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+
+==
+Python Data Source API
+==
+
+.. currentmodule:: pyspark.sql
+
+Overview
+
+The Python Data Source API is a new feature introduced in Spark 4.0, enabling 
developers to read from custom data sources and write to custom data sinks in 
Python.
+This guide provides a comprehensive overview of the API and instructions on 
how to create, use, and manage Python data sources.
+
+
+Creating a Python Data Source
+-
+To create a custom Python data source, you'll need to subclass the 
:class:`DataSource` base classes and implement the necessary methods for 
reading and writing data.
+
+This example demonstrates creating a simple data source to generate synthetic 
data using the `faker` library. Ensure the `faker` library is installed and 
accessible in your Python environment.
+
+**Step 1: Define the Data Source**
+
+Start by creating a new subclass of :class:`DataSource`. Define the source 
name, schema, and reader logic as follows:
+
+.. code-block:: python
+
+from pyspark.sql.datasource import DataSource, DataSourceReader
+from pyspark.sql.types import StructType
+
+class FakeDataSource(DataSource):
+"""
+A fake data source for PySpark to generate synthetic data using the 
`faker` library.
+Options:
+- numRows: specify number of rows to generate. Default value is 3.
+"""
+
+@classmethod
+def name(cls):
+return "fake"
+
+def schema(self):
+return "name string, date string, zipcode string, state string"
+
+def reader(self, schema: StructType):
+return FakeDataSourceReader(schema, self.options)
+
+
+**Step 2: Implement the Reader**
+
+Define the reader logic to generate synthetic data. Use the `faker` library to 
populate each field in the schema.
+
+.. code-block:: python
+
+class FakeDataSourceReader(DataSourceReader):
+
+def __init__(self, schema, options):
+self.schema: StructType = schema
+self.options = options
+
+def read(self, partition):
+from faker import Faker
+fake = Faker()
+# Note: every value in this `self.options` dictionary is a string.
+num_rows = int(self.options.get("numRows", 3))
+for _ in range(num_rows):
+row = []
+for field in self.schema.fields:
+value = getattr(fake, field.name)()
+row.append(value)
+yield tuple(row)
+
+
+Using a Python Data Source

Review Comment:
   Could we add a small section for `pip install` case too? I have an example 
at https://github.com/hyukjinkwon/pyspark-jira. We should probably say that 
this is not supported with Spark Connect client side (thus it has to be 
installed in Spark Connect server side).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47877][SS][CONNECT] Speed up test_parity_listener [spark]

2024-04-16 Thread via GitHub


HyukjinKwon closed pull request #46072: [SPARK-47877][SS][CONNECT] Speed up 
test_parity_listener
URL: https://github.com/apache/spark/pull/46072


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47877][SS][CONNECT] Speed up test_parity_listener [spark]

2024-04-16 Thread via GitHub


HyukjinKwon commented on PR #46072:
URL: https://github.com/apache/spark/pull/46072#issuecomment-2060083261

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47875][CORE] Remove `spark.deploy.recoverySerializer` [spark]

2024-04-16 Thread via GitHub


dongjoon-hyun closed pull request #46087: [SPARK-47875][CORE] Remove 
`spark.deploy.recoverySerializer`
URL: https://github.com/apache/spark/pull/46087


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47760][SPARK-47763][CONNECT][TESTS] Reeanble Avro and Protobuf function doctests [spark]

2024-04-16 Thread via GitHub


HyukjinKwon closed pull request #46055: 
[SPARK-47760][SPARK-47763][CONNECT][TESTS] Reeanble Avro and Protobuf function 
doctests
URL: https://github.com/apache/spark/pull/46055


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47760][SPARK-47763][CONNECT][TESTS] Reeanble Avro and Protobuf function doctests [spark]

2024-04-16 Thread via GitHub


HyukjinKwon commented on PR #46055:
URL: https://github.com/apache/spark/pull/46055#issuecomment-2060081075

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47816][CONNECT][DOCS] Document the lazy evaluation of views in `spark.{sql, table}` [spark]

2024-04-16 Thread via GitHub


allisonwang-db commented on code in PR #46007:
URL: https://github.com/apache/spark/pull/46007#discussion_r1568042050


##
python/pyspark/sql/session.py:
##
@@ -1630,6 +1630,13 @@ def sql(
 ---
 :class:`DataFrame`
 
+Notes
+-
+In Spark Classic, a temporary view referenced in `spark.sql` is 
resolved immediately,

Review Comment:
   How about temp functions?



##
python/pyspark/sql/session.py:
##
@@ -1630,6 +1630,13 @@ def sql(
 ---
 :class:`DataFrame`
 
+Notes
+-
+In Spark Classic, a temporary view referenced in `spark.sql` is 
resolved immediately,
+while in Spark Connect it is lazily evaluated.

Review Comment:
   I think this note might be very confusing to users, as data frames in Spark 
are all lazily evaluated, right? Maybe we can say "it is lazily analyzed". 
   
   We should probably document this as a behavior change for Spark Connect. I 
am pretty sure there are other behavior changes. Also does this lazy analysis 
apply to persistent tables and views as well?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47868][CONNECT] Fix recursion limit error in SparkConnectPlanner and SparkSession [spark]

2024-04-16 Thread via GitHub


zhengruifeng closed pull request #46075: [SPARK-47868][CONNECT] Fix recursion 
limit error in SparkConnectPlanner and SparkSession
URL: https://github.com/apache/spark/pull/46075


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47868][CONNECT] Fix recursion limit error in SparkConnectPlanner and SparkSession [spark]

2024-04-16 Thread via GitHub


zhengruifeng commented on PR #46075:
URL: https://github.com/apache/spark/pull/46075#issuecomment-2060071663

   merged to master


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [SPARK-46375][DOCS] Add user guide for Python data source API [spark]

2024-04-16 Thread via GitHub


allisonwang-db opened a new pull request, #46089:
URL: https://github.com/apache/spark/pull/46089

   
   
   ### What changes were proposed in this pull request?
   
   This PR adds a new user guide for the Python data source API with a simple 
example. More examples (including streaming) will be added in the future.
   
   ### Why are the changes needed?
   
   To improve the documentation
   
   ### Does this PR introduce _any_ user-facing change?
   
   No 
   
   ### How was this patch tested?
   
   doctest
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47876][PYTHON][DOCS] Improve docstring of mapInArrow [spark]

2024-04-16 Thread via GitHub


zhengruifeng commented on PR #46088:
URL: https://github.com/apache/spark/pull/46088#issuecomment-2060067749

   the doc of `mapInArrow` is similar to `mapInPandas`, shall we refine the 
latter too?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47618][CORE] Use `Magic Committer` for all S3 buckets by default [spark]

2024-04-16 Thread via GitHub


dongjoon-hyun commented on PR #45740:
URL: https://github.com/apache/spark/pull/45740#issuecomment-2060044971

   Thank you for your feedback, @steveloughran . Ya, as you mentioned, this is 
blocked by exactly those two configurations.
   ```
   
spark.sql.parquet.output.committer.class=org.apache.spark.internal.io.cloud.BindingParquetOutputCommitter
   
spark.sql.sources.commitProtocolClass=org.apache.spark.internal.io.cloud.PathOutputCommitProtocol
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47877][SS][CONNECT] Speed up test_parity_listener [spark]

2024-04-16 Thread via GitHub


WweiL commented on PR #46072:
URL: https://github.com/apache/spark/pull/46072#issuecomment-2060036826

   @HyukjinKwon Can you take a look? Thank you!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47810][SQL] Replace equivalent expression to <=> in join condition [spark]

2024-04-16 Thread via GitHub


anton5798 commented on code in PR #45999:
URL: https://github.com/apache/spark/pull/45999#discussion_r1568012332


##
sql/core/src/test/scala/org/apache/spark/sql/DataFrameJoinSuite.scala:
##
@@ -621,4 +621,14 @@ class DataFrameJoinSuite extends QueryTest
 checkAnswer(joined, Row("x", null, null))
 checkAnswer(joined.filter($"new".isNull), Row("x", null, null))
   }
+
+  test("SPARK-47810: replace equivalent expression to <=> in join condition") {

Review Comment:
   Let's also add a summary of plan change (before/after) and quantified 
performance benefit in the PR description



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47810][SQL] Replace equivalent expression to <=> in join condition [spark]

2024-04-16 Thread via GitHub


anton5798 commented on code in PR #45999:
URL: https://github.com/apache/spark/pull/45999#discussion_r1568011225


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeJoinCondition.scala:
##
@@ -0,0 +1,45 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.optimizer
+
+import org.apache.spark.sql.catalyst.expressions.{And, EqualNullSafe, EqualTo, 
IsNull, Or, PredicateHelper}
+import org.apache.spark.sql.catalyst.plans.logical.{Join, LogicalPlan}
+import org.apache.spark.sql.catalyst.rules.Rule
+import org.apache.spark.sql.catalyst.trees.TreePattern.{JOIN, OR}
+
+/**
+ * Replaces `t1.id is null and t2.id is null or t1.id = t2.id` to `t1.id <=> 
t2.id`
+ * in join condition for better performance.
+ */
+object OptimizeJoinCondition extends Rule[LogicalPlan] with PredicateHelper {
+  override def apply(plan: LogicalPlan): LogicalPlan = 
plan.transformWithPruning(

Review Comment:
   Even though this is a simple rule, we should add a flag guard here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47810][SQL] Replace equivalent expression to <=> in join condition [spark]

2024-04-16 Thread via GitHub


anton5798 commented on code in PR #45999:
URL: https://github.com/apache/spark/pull/45999#discussion_r1568005755


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeJoinCondition.scala:
##
@@ -0,0 +1,55 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.optimizer
+
+import org.apache.spark.sql.catalyst.expressions.{And, EqualNullSafe, EqualTo, 
IsNull, Or, PredicateHelper}
+import org.apache.spark.sql.catalyst.plans.logical.{Join, LogicalPlan}
+import org.apache.spark.sql.catalyst.rules.Rule
+import org.apache.spark.sql.catalyst.trees.TreePattern.JOIN
+
+/**
+ * Replaces `t1.id is null and t2.id is null or t1.id = t2.id` to `t1.id <=> 
t2.id`
+ * in join condition for better performance.
+ */
+object OptimizeJoinCondition extends Rule[LogicalPlan] with PredicateHelper {
+  override def apply(plan: LogicalPlan): LogicalPlan = 
plan.transformWithPruning(
+_.containsPattern(JOIN), ruleId) {
+case join @ Join(left, right, joinType, condition, hint) if 
condition.nonEmpty =>
+  val predicates = condition.map(splitConjunctivePredicates).get
+  var replace = false
+  val newPredicates = predicates.map {
+case Or(EqualTo(l, r), And(IsNull(c1), IsNull(c2)))

Review Comment:
   If the target scenario just has `[l = r OR (isnull(l) and isnull(r))]` 
somewhere in the expression tree, then I think don't need to 
`splitConjunctivePredicates`. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47875][CORE] Remove `spark.deploy.recoverySerializer` [spark]

2024-04-16 Thread via GitHub


dongjoon-hyun commented on PR #46087:
URL: https://github.com/apache/spark/pull/46087#issuecomment-2060008470

   I removed the missed SPARK-46205 test case.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47588][CORE] Hive module: Migrate logInfo with variables to structured logging framework [spark]

2024-04-16 Thread via GitHub


gengliangwang commented on code in PR #46086:
URL: https://github.com/apache/spark/pull/46086#discussion_r1567997561


##
sql/hive/src/main/scala/org/apache/spark/sql/hive/client/IsolatedClientLoader.scala:
##
@@ -20,18 +20,18 @@ package org.apache.spark.sql.hive.client
 import java.io.{File, PrintStream}
 import java.lang.reflect.InvocationTargetException
 import java.net.{URL, URLClassLoader}
-import java.util
 
+import java.util

Review Comment:
   Thanks, now `./dev/scalastyle` should pass.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47875][CORE] Remove `spark.deploy.recoverySerializer` [spark]

2024-04-16 Thread via GitHub


dongjoon-hyun commented on PR #46087:
URL: https://github.com/apache/spark/pull/46087#issuecomment-2060007634

   Yes, there are other commits about `compression` code and some neutral 
changes. I believe it will be okay and the final goal is to bring it back again.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [SPARK-47876][PYTHON][DOCS] Improve docstring of mapInArrow [spark]

2024-04-16 Thread via GitHub


xinrong-meng opened a new pull request, #46088:
URL: https://github.com/apache/spark/pull/46088

   ### What changes were proposed in this pull request?
   Improve docstring of mapInArrow:
   
   - "using a Python native function that takes and outputs a PyArrow's 
RecordBatch" is confusing cause the function takes and outputs "ITERATOR of 
RecordBatchs" instead.
   - "All columns are passed together as an iterator of pyarrow.RecordBatchs" 
easily mislead users to think the entire DataFrame will be passed together, "a 
batch of rows" is used instead.
   
   ### Why are the changes needed?
   More accurate and clear docstring.
   
   ### Does this PR introduce _any_ user-facing change?
   No.
   
   ### How was this patch tested?
   Doc change only.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47875][CORE] Remove `spark.deploy.recoverySerializer` [spark]

2024-04-16 Thread via GitHub


dongjoon-hyun commented on PR #46087:
URL: https://github.com/apache/spark/pull/46087#issuecomment-2060004457

   Thank you so much for swift help. I'll make it sure that all CIes passes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47588][CORE] Hive module: Migrate logInfo with variables to structured logging framework [spark]

2024-04-16 Thread via GitHub


dongjoon-hyun commented on code in PR #46086:
URL: https://github.com/apache/spark/pull/46086#discussion_r1567994515


##
sql/hive/src/main/scala/org/apache/spark/sql/hive/client/IsolatedClientLoader.scala:
##
@@ -20,18 +20,18 @@ package org.apache.spark.sql.hive.client
 import java.io.{File, PrintStream}
 import java.lang.reflect.InvocationTargetException
 import java.net.{URL, URLClassLoader}
-import java.util
 
+import java.util

Review Comment:
   ? This change still looks weird.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47875][CORE] Remove `spark.deploy.recoverySerializer` [spark]

2024-04-16 Thread via GitHub


viirya commented on PR #46087:
URL: https://github.com/apache/spark/pull/46087#issuecomment-2060002347

   Pending CI. Thanks @dongjoon-hyun 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47588][CORE] Hive module: Migrate logInfo with variables to structured logging framework [spark]

2024-04-16 Thread via GitHub


gengliangwang commented on code in PR #46086:
URL: https://github.com/apache/spark/pull/46086#discussion_r1567992633


##
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala:
##
@@ -31,10 +32,10 @@ import 
org.apache.hadoop.hive.metastore.api.hive_metastoreConstants.DDL_TIME
 import org.apache.hadoop.hive.ql.metadata.HiveException
 import org.apache.hadoop.hive.serde.serdeConstants.SERIALIZATION_FORMAT
 import org.apache.thrift.TException
-

Review Comment:
   Thanks, it is fixed now.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47875][CORE] Remove `spark.deploy.recoverySerializer` [spark]

2024-04-16 Thread via GitHub


dongjoon-hyun commented on PR #46087:
URL: https://github.com/apache/spark/pull/46087#issuecomment-2059998699

   Sorry but could you review this reverting PR, @viirya ? While I've running 
this, I found my mistake. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [SPARK-47875][CORE] Remove `spark.deploy.recoverySerializer` [spark]

2024-04-16 Thread via GitHub


dongjoon-hyun opened a new pull request, #46087:
URL: https://github.com/apache/spark/pull/46087

   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47588][CORE] Hive module: Migrate logInfo with variables to structured logging framework [spark]

2024-04-16 Thread via GitHub


dongjoon-hyun commented on code in PR #46086:
URL: https://github.com/apache/spark/pull/46086#discussion_r1567987357


##
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala:
##
@@ -31,10 +32,10 @@ import 
org.apache.hadoop.hive.metastore.api.hive_metastoreConstants.DDL_TIME
 import org.apache.hadoop.hive.ql.metadata.HiveException
 import org.apache.hadoop.hive.serde.serdeConstants.SERIALIZATION_FORMAT
 import org.apache.thrift.TException
-

Review Comment:
   ? Could you double-check your linter, @gengliangwang ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47838][BUILD] Upgrade `rocksdbjni` to 8.11.4 [spark]

2024-04-16 Thread via GitHub


neilramaswamy commented on PR #46065:
URL: https://github.com/apache/spark/pull/46065#issuecomment-2059969162

   @dongjoon-hyun numbers are still approximately the same (I just updated with 
the latest results), a few are better. Seems safe to merge when CI passes. 
Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47805][SS] Implementing TTL for MapState [spark]

2024-04-16 Thread via GitHub


ericm-db commented on code in PR #45991:
URL: https://github.com/apache/spark/pull/45991#discussion_r1567940592


##
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MapStateImplWithTTL.scala:
##
@@ -0,0 +1,265 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.execution.streaming
+
+import org.apache.spark.internal.Logging
+import org.apache.spark.sql.Encoder
+import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder
+import 
org.apache.spark.sql.execution.streaming.TransformWithStateKeyValueRowSchema.{COMPOSITE_KEY_ROW_SCHEMA,
 VALUE_ROW_SCHEMA_WITH_TTL}
+import 
org.apache.spark.sql.execution.streaming.state.{PrefixKeyScanStateEncoderSpec, 
StateStore, StateStoreErrors}
+import org.apache.spark.sql.streaming.{MapState, TTLConfig}
+import org.apache.spark.util.NextIterator
+
+
+class MapStateImplWithTTL[K, V](
+  store: StateStore,
+  stateName: String,
+  keyExprEnc: ExpressionEncoder[Any],
+  userKeyEnc: Encoder[K],
+  valEncoder: Encoder[V],
+  ttlConfig: TTLConfig,
+  batchTimestampMs: Long) extends CompositeKeyTTLStateImpl(stateName, store, 
batchTimestampMs)
+  with MapState[K, V] with Logging {
+
+  private val keySerializer = keyExprEnc.createSerializer()
+  private val stateTypesEncoder = new CompositeKeyStateEncoder(
+keySerializer, userKeyEnc, valEncoder, COMPOSITE_KEY_ROW_SCHEMA, 
stateName, hasTtl = true)
+
+  private val ttlExpirationMs =
+StateTTL.calculateExpirationTimeForDuration(ttlConfig.ttlDuration, 
batchTimestampMs)
+
+  initialize()
+
+  private def initialize(): Unit = {
+store.createColFamilyIfAbsent(stateName, COMPOSITE_KEY_ROW_SCHEMA, 
VALUE_ROW_SCHEMA_WITH_TTL,
+  PrefixKeyScanStateEncoderSpec(COMPOSITE_KEY_ROW_SCHEMA, 1))
+  }
+
+  /** Whether state exists or not. */
+  override def exists(): Boolean = {
+iterator().nonEmpty
+  }
+
+  /** Get the state value if it exists */
+  override def getValue(key: K): V = {
+StateStoreErrors.requireNonNullStateValue(key, stateName)
+val encodedCompositeKey = stateTypesEncoder.encodeCompositeKey(key)
+val retRow = store.get(encodedCompositeKey, stateName)
+
+if (retRow != null) {
+  val resState = stateTypesEncoder.decodeValue(retRow)
+
+  if (!stateTypesEncoder.isExpired(retRow, batchTimestampMs)) {
+resState
+  } else {
+null.asInstanceOf[V]
+  }
+} else {
+  null.asInstanceOf[V]
+}
+  }
+
+  /** Check if the user key is contained in the map */
+  override def containsKey(key: K): Boolean = {
+StateStoreErrors.requireNonNullStateValue(key, stateName)
+getValue(key) != null
+  }
+
+  /** Update value for given user key */
+  override def updateValue(key: K, value: V): Unit = {
+StateStoreErrors.requireNonNullStateValue(key, stateName)
+StateStoreErrors.requireNonNullStateValue(value, stateName)
+val encodedValue = stateTypesEncoder.encodeValue(value, ttlExpirationMs)
+val encodedCompositeKey = stateTypesEncoder.encodeCompositeKey(key)
+store.put(encodedCompositeKey, encodedValue, stateName)
+val serializedGroupingKey = stateTypesEncoder.serializeGroupingKey()
+val serializedUserKey = stateTypesEncoder.serializeUserKey(key)
+upsertTTLForStateKey(ttlExpirationMs, serializedGroupingKey, 
serializedUserKey)
+  }
+
+  /** Get the map associated with grouping key */
+  override def iterator(): Iterator[(K, V)] = {
+val encodedGroupingKey = stateTypesEncoder.encodeGroupingKey()
+val unsafeRowPairIterator = store.prefixScan(encodedGroupingKey, stateName)
+new NextIterator[(K, V)] {
+  override protected def getNext(): (K, V) = {
+val iter = unsafeRowPairIterator.dropWhile { rowPair =>
+  stateTypesEncoder.isExpired(rowPair.value, batchTimestampMs)
+}
+if (iter.hasNext) {
+  val currentRowPair = iter.next()
+  val key = stateTypesEncoder.decodeCompositeKey(currentRowPair.key)
+  val value = stateTypesEncoder.decodeValue(currentRowPair.value)
+  (key, value)
+} else {
+  finished = true
+  null.asInstanceOf[(K, V)]
+}
+  }
+
+  override protected def close(): Unit = {}
+}
+  }
+
+  

Re: [PR] [SPARK-47590][SQL] Hive-thriftserver: Migrate logWarn with variables to structured logging framework [spark]

2024-04-16 Thread via GitHub


gengliangwang commented on PR #45923:
URL: https://github.com/apache/spark/pull/45923#issuecomment-2059908947

   Thanks, merging to master


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47590][SQL] Hive-thriftserver: Migrate logWarn with variables to structured logging framework [spark]

2024-04-16 Thread via GitHub


gengliangwang closed pull request #45923: [SPARK-47590][SQL] Hive-thriftserver: 
Migrate logWarn with variables to structured logging framework
URL: https://github.com/apache/spark/pull/45923


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SHUFFLE] [WIP] Prototype: store shuffle file on external storage like S3 [spark]

2024-04-16 Thread via GitHub


steveloughran commented on PR #34864:
URL: https://github.com/apache/spark/pull/34864#issuecomment-2059907838

   @michaelbilow hadoop s3a is on v2 sdk; the com.amazonaws classes are not on 
the CP and amazon are slowly stopping support. you cannot for example use the 
lower latency S3 express stores with it.
   
   Like I say: I think you would be better off using the Hue file system APIs 
to talk to s3. If there are aspects of s3 storage which aren't available 
through the API -or just very inefficiently due to the effort to preserve the 
Posix metaphor, then lets fix the API so that other stores can offer the same 
features, and other apps can pick up.
   
   For example, here's our ongoing delete API for iceberg and other 
manifest-based tables
   https://github.com/apache/hadoop/pull/6726
   It maps to s3 bulk delete calls, but there's scope to add to other stores 
(we now actually want to add it as a page-size == 1 option for all filesystems 
as it simplifies iceberg integration). 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47618][CORE] Use `Magic Committer` for all S3 buckets by default [spark]

2024-04-16 Thread via GitHub


steveloughran commented on PR #45740:
URL: https://github.com/apache/spark/pull/45740#issuecomment-2059899891

   I have no problems with the PR; we have made it the default in our releases.
   
   This could be a good time to revisit "why there's some separate 
PathOutputCommitter" stuff; originally it was because spark built against 
releases without the new PathOutputCommitter interface. This no longer holds: 
could anything needed from it be pulled up into the main committer?
   
   One recurrent troublespot we have with committing work is parquet; it 
requires all committers to be a subclass of ParquetOutputCommitter, hence the 
(ugly, brittle) wrapping stuff. Life will be a lot easier if parquet didn't 
mind if it was any PathOutputCommitter -it would just skip the schema writing.
   
   Of course, we then come up against the fact that parquet still wants to 
build against hadoop 2.8. Everyone needs to move on, especially as hadoop 
java11+ support is 3.2.x+ only. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47627][SQL] Add SQL MERGE syntax to enable schema evolution [spark]

2024-04-16 Thread via GitHub


xupefei commented on PR #45748:
URL: https://github.com/apache/spark/pull/45748#issuecomment-2059899197

   > @xupefei could you provide more details in the PR description? For 
example, what is the difference with/without `WITH SCHEMA EVOLUTION`
   
   Hi @gengliangwang, I added to the PR description as you advised. Please have 
a look!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47805][SS] Implementing TTL for MapState [spark]

2024-04-16 Thread via GitHub


ericm-db commented on code in PR #45991:
URL: https://github.com/apache/spark/pull/45991#discussion_r1567908966


##
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MapStateImplWithTTL.scala:
##
@@ -0,0 +1,265 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.execution.streaming
+
+import org.apache.spark.internal.Logging
+import org.apache.spark.sql.Encoder
+import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder
+import 
org.apache.spark.sql.execution.streaming.TransformWithStateKeyValueRowSchema.{COMPOSITE_KEY_ROW_SCHEMA,
 VALUE_ROW_SCHEMA_WITH_TTL}
+import 
org.apache.spark.sql.execution.streaming.state.{PrefixKeyScanStateEncoderSpec, 
StateStore, StateStoreErrors}
+import org.apache.spark.sql.streaming.{MapState, TTLConfig}
+import org.apache.spark.util.NextIterator
+
+
+class MapStateImplWithTTL[K, V](
+  store: StateStore,
+  stateName: String,
+  keyExprEnc: ExpressionEncoder[Any],
+  userKeyEnc: Encoder[K],
+  valEncoder: Encoder[V],
+  ttlConfig: TTLConfig,
+  batchTimestampMs: Long) extends CompositeKeyTTLStateImpl(stateName, store, 
batchTimestampMs)
+  with MapState[K, V] with Logging {
+
+  private val keySerializer = keyExprEnc.createSerializer()
+  private val stateTypesEncoder = new CompositeKeyStateEncoder(
+keySerializer, userKeyEnc, valEncoder, COMPOSITE_KEY_ROW_SCHEMA, 
stateName, hasTtl = true)
+
+  private val ttlExpirationMs =
+StateTTL.calculateExpirationTimeForDuration(ttlConfig.ttlDuration, 
batchTimestampMs)
+
+  initialize()
+
+  private def initialize(): Unit = {
+store.createColFamilyIfAbsent(stateName, COMPOSITE_KEY_ROW_SCHEMA, 
VALUE_ROW_SCHEMA_WITH_TTL,
+  PrefixKeyScanStateEncoderSpec(COMPOSITE_KEY_ROW_SCHEMA, 1))
+  }
+
+  /** Whether state exists or not. */
+  override def exists(): Boolean = {
+iterator().nonEmpty
+  }
+
+  /** Get the state value if it exists */
+  override def getValue(key: K): V = {
+StateStoreErrors.requireNonNullStateValue(key, stateName)
+val encodedCompositeKey = stateTypesEncoder.encodeCompositeKey(key)
+val retRow = store.get(encodedCompositeKey, stateName)
+
+if (retRow != null) {
+  val resState = stateTypesEncoder.decodeValue(retRow)
+
+  if (!stateTypesEncoder.isExpired(retRow, batchTimestampMs)) {
+resState
+  } else {
+null.asInstanceOf[V]
+  }
+} else {
+  null.asInstanceOf[V]
+}
+  }
+
+  /** Check if the user key is contained in the map */
+  override def containsKey(key: K): Boolean = {
+StateStoreErrors.requireNonNullStateValue(key, stateName)
+getValue(key) != null
+  }
+
+  /** Update value for given user key */
+  override def updateValue(key: K, value: V): Unit = {
+StateStoreErrors.requireNonNullStateValue(key, stateName)
+StateStoreErrors.requireNonNullStateValue(value, stateName)
+val encodedValue = stateTypesEncoder.encodeValue(value, ttlExpirationMs)
+val encodedCompositeKey = stateTypesEncoder.encodeCompositeKey(key)
+store.put(encodedCompositeKey, encodedValue, stateName)
+val serializedGroupingKey = stateTypesEncoder.serializeGroupingKey()
+val serializedUserKey = stateTypesEncoder.serializeUserKey(key)
+upsertTTLForStateKey(ttlExpirationMs, serializedGroupingKey, 
serializedUserKey)
+  }
+
+  /** Get the map associated with grouping key */
+  override def iterator(): Iterator[(K, V)] = {
+val encodedGroupingKey = stateTypesEncoder.encodeGroupingKey()
+val unsafeRowPairIterator = store.prefixScan(encodedGroupingKey, stateName)
+new NextIterator[(K, V)] {
+  override protected def getNext(): (K, V) = {
+val iter = unsafeRowPairIterator.dropWhile { rowPair =>
+  stateTypesEncoder.isExpired(rowPair.value, batchTimestampMs)
+}
+if (iter.hasNext) {
+  val currentRowPair = iter.next()
+  val key = stateTypesEncoder.decodeCompositeKey(currentRowPair.key)
+  val value = stateTypesEncoder.decodeValue(currentRowPair.value)
+  (key, value)
+} else {
+  finished = true
+  null.asInstanceOf[(K, V)]
+}
+  }
+
+  override protected def close(): Unit = {}
+}
+  }
+
+  

Re: [PR] [SPARK-47588][CORE] Hive module: Migrate logInfo with variables to structured logging framework [spark]

2024-04-16 Thread via GitHub


gengliangwang commented on PR #46086:
URL: https://github.com/apache/spark/pull/46086#issuecomment-2059874634

   cc @panbingkun @itholic 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [SPARK-47588][CORE] Hive module: Migrate logInfo with variables to structured logging framework [spark]

2024-04-16 Thread via GitHub


gengliangwang opened a new pull request, #46086:
URL: https://github.com/apache/spark/pull/46086

   
   
   ### What changes were proposed in this pull request?
   
   Migrate logInfo in Hive module with variables to structured logging 
framework.
   
   
   
   ### Why are the changes needed?
   
   To enhance Apache Spark's logging system by implementing structured logging.
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   No
   
   ### How was this patch tested?
   
   
   GA tests
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47594] Connector module: Migrate logInfo with variables to structured logging framework [spark]

2024-04-16 Thread via GitHub


gengliangwang closed pull request #46022: [SPARK-47594] Connector module: 
Migrate logInfo with variables to structured logging framework
URL: https://github.com/apache/spark/pull/46022


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47594] Connector module: Migrate logInfo with variables to structured logging framework [spark]

2024-04-16 Thread via GitHub


gengliangwang commented on PR #46022:
URL: https://github.com/apache/spark/pull/46022#issuecomment-2059867235

   Thanks, merging to master


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47858][SPARK-47852][PYTHON][SQL] Refactoring the structure for DataFrame error context [spark]

2024-04-16 Thread via GitHub


ueshin commented on code in PR #46063:
URL: https://github.com/apache/spark/pull/46063#discussion_r1567889537


##
sql/api/src/main/scala/org/apache/spark/sql/catalyst/trees/QueryContexts.scala:
##
@@ -160,6 +160,8 @@ case class DataFrameQueryContext(
   val pysparkFragment: String = pysparkErrorContext.map(_._1).getOrElse("")
   val pysparkCallSite: String = pysparkErrorContext.map(_._2).getOrElse("")
 
+  PySparkCurrentOrigin.clear()

Review Comment:
   If we clear `PySparkCurrentOrigin` right after `get`, seems like `lit` here: 
https://github.com/apache/spark/blob/ccb87fb26481acecbd75682ddc8fa9e5dd52e8cb/sql/core/src/main/scala/org/apache/spark/sql/Column.scala#L171
   consumes the PySpark context in this case. We may want to clear it after we 
call the JVM method.
   
   ```py
   _capture_call_site(func.__name__)
   
   try: 
   return func(*args, **kwargs)
   finally:
   pyspark_origin.clean()
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47838][BUILD] Upgrade rocksdbjni to 8.11.4 [spark]

2024-04-16 Thread via GitHub


neilramaswamy commented on PR #46065:
URL: https://github.com/apache/spark/pull/46065#issuecomment-2059736653

   The JDK 21 results are _slightly_ slower than what they were before, which 
is odd since nothing really changed between these released. So I'll confirm if 
this is just variance by running again.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47845][SQL][PYTHON][CONNECT] Support Column type in split function for scala and python [spark]

2024-04-16 Thread via GitHub


CTCC1 commented on code in PR #46045:
URL: https://github.com/apache/spark/pull/46045#discussion_r1566772319


##
python/pyspark/sql/connect/functions/builtin.py:
##
@@ -2476,8 +2476,26 @@ def repeat(col: "ColumnOrName", n: Union["ColumnOrName", 
int]) -> Column:
 repeat.__doc__ = pysparkfuncs.repeat.__doc__
 
 
-def split(str: "ColumnOrName", pattern: str, limit: int = -1) -> Column:
-return _invoke_function("split", _to_col(str), lit(pattern), lit(limit))
+def split(
+str: "ColumnOrName",
+pattern: Union[Column, str],
+limit: Union["ColumnOrName", int] = -1,
+) -> Column:
+# work around shadowing of str in the input variable name
+from builtins import str as py_str
+
+if isinstance(pattern, py_str):
+_pattern = lit(pattern)
+elif isinstance(pattern, Column):
+_pattern = pattern
+else:
+raise PySparkTypeError(
+error_class="NOT_COLUMN_OR_STR",
+message_parameters={"arg_name": "pattern", "arg_type": 
type(pattern).__name__},
+)
+
+limit = lit(limit) if isinstance(limit, int) else _to_col(limit)
+return _invoke_function("split", _to_col(str), _pattern, limit)

Review Comment:
   Thanks for the suggestion, this is simpler for sure! The only concern is 
that we will not raise `PySparkTypeError` if `pattern` is passed in for a type 
other than `Column` or `str`, and it will form a UnresolvedFunction. Is raising 
such error early a requirement for connect?
   
   For now I simplified the code but kept the error raise. Lmk if we want to 
skip raising the error.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47845][SQL][PYTHON][CONNECT] Support Column type in split function for scala and python [spark]

2024-04-16 Thread via GitHub


CTCC1 commented on code in PR #46045:
URL: https://github.com/apache/spark/pull/46045#discussion_r1567800324


##
connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/functions.scala:
##
@@ -4197,6 +4197,20 @@ object functions {
*/
   def split(str: Column, pattern: String): Column = Column.fn("split", str, 
lit(pattern))
 
+  /**
+   * Splits str around matches of the given pattern.
+   *
+   * @param str
+   *   a string expression to split
+   * @param pattern
+   *   a column of string representing a regular expression. The regex string 
should be a Java
+   *   regular expression.
+   *
+   * @group string_funcs
+   * @since 4.0.0
+   */
+  def split(str: Column, pattern: Column): Column = Column.fn("split", str, 
pattern)

Review Comment:
   thanks for the pointer, fixed



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47839][SQL] Fix aggregate bug in RewriteWithExpression [spark]

2024-04-16 Thread via GitHub


kelvinjian-db commented on code in PR #46034:
URL: https://github.com/apache/spark/pull/46034#discussion_r1567767274


##
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/RewriteWithExpressionSuite.scala:
##
@@ -29,25 +29,29 @@ import org.apache.spark.sql.types.IntegerType
 class RewriteWithExpressionSuite extends PlanTest {
 
   object Optimizer extends RuleExecutor[LogicalPlan] {
-val batches = Batch("Rewrite With expression", Once, 
RewriteWithExpression) :: Nil
+val batches = Batch("Rewrite With expression", Once,
+  PullOutGroupingExpressions,
+  RewriteWithExpression) :: Nil
   }
 
   private val testRelation = LocalRelation($"a".int, $"b".int)
   private val testRelation2 = LocalRelation($"x".int, $"y".int)
 
   test("simple common expression") {
 val a = testRelation.output.head
-val commonExprDef = CommonExpressionDef(a)
-val ref = new CommonExpressionRef(commonExprDef)
-val plan = testRelation.select(With(ref + ref, 
Seq(commonExprDef)).as("col"))
+val expr = With.create((a, 0)) { case Seq(ref) =>

Review Comment:
   i ended up changing the tests to manually extract the common expression IDs 
so we don't need `With.create` anymore



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47627][SQL] Add SQL MERGE syntax to enable schema evolution [spark]

2024-04-16 Thread via GitHub


gengliangwang commented on PR #45748:
URL: https://github.com/apache/spark/pull/45748#issuecomment-2059645090

   @xupefei could you provide more details in the PR description? For example, 
what is the difference with/without  `WITH SCHEMA EVOLUTION`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47745] Add License to Spark Operator repository [spark-kubernetes-operator]

2024-04-16 Thread via GitHub


viirya commented on PR #3:
URL: 
https://github.com/apache/spark-kubernetes-operator/pull/3#issuecomment-2059585966

   Thank you @jiangzho @dongjoon-hyun 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   >