Re: [PR] [SPARK-47839][SQL] Fix aggregate bug in RewriteWithExpression [spark]
cloud-fan commented on PR #46034: URL: https://github.com/apache/spark/pull/46034#issuecomment-2060437664 The test fails: `org.apache.spark.sql.connect.ProtoToParsedPlanTestSuite` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47839][SQL] Fix aggregate bug in RewriteWithExpression [spark]
cloud-fan commented on code in PR #46034: URL: https://github.com/apache/spark/pull/46034#discussion_r1568259745 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteWithExpression.scala: ## @@ -21,36 +21,68 @@ import scala.collection.mutable import org.apache.spark.SparkException import org.apache.spark.sql.catalyst.expressions._ -import org.apache.spark.sql.catalyst.plans.logical.{LogicalPlan, PlanHelper, Project} +import org.apache.spark.sql.catalyst.planning.PhysicalAggregation +import org.apache.spark.sql.catalyst.plans.logical.{Aggregate, LogicalPlan, PlanHelper, Project} import org.apache.spark.sql.catalyst.rules.Rule import org.apache.spark.sql.catalyst.trees.TreePattern.{COMMON_EXPR_REF, WITH_EXPRESSION} /** * Rewrites the `With` expressions by adding a `Project` to pre-evaluate the common expressions, or * just inline them if they are cheap. * + * Since this rule can introduce new `Project` operators, it is advised to run [[CollapseProject]] + * after this rule. + * * Note: For now we only use `With` in a few `RuntimeReplaceable` expressions. If we expand its * usage, we should support aggregate/window functions as well. */ object RewriteWithExpression extends Rule[LogicalPlan] { override def apply(plan: LogicalPlan): LogicalPlan = { - plan.transformDownWithSubqueriesAndPruning(_.containsPattern(WITH_EXPRESSION)) { + plan.transformUpWithSubqueriesAndPruning(_.containsPattern(WITH_EXPRESSION)) { + // For aggregates, separate the computation of the aggregations themselves from the final + // result by moving the final result computation into a projection above it. This prevents + // this rule from producing an invalid Aggregate operator. + case p @ PhysicalAggregation( + groupingExpressions, aggregateExpressions, resultExpressions, child) + if p.expressions.exists(_.containsPattern(WITH_EXPRESSION)) => +// There should not be dangling common expression references in the aggregate expressions. +// This can happen if a With is created with an aggregate function in its child. +assert(!aggregateExpressions.exists(ae => Review Comment: Shall we do the assert in the constructor of `With`, to fail earlier? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-46812][CONNECT][PYTHON][FOLLOW-UP] Add pyspark.pyspark.sql.connect.resource into PyPi packaging [spark]
HyukjinKwon opened a new pull request, #46094: URL: https://github.com/apache/spark/pull/46094 ### What changes were proposed in this pull request? This PR proposes to add `pyspark.pyspark.sql.connect.resource` into PyPi packaging. ### Why are the changes needed? In order for PyPI end users to download PySpark and leverage this feature. ### Does this PR introduce _any_ user-facing change? No, the main change has not been released. ### How was this patch tested? Being tested at https://github.com/apache/spark/pull/46090 ### Was this patch authored or co-authored using generative AI tooling? No. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47591][SQL] Hive-thriftserver: Migrate logInfo with variables to structured logging framework [spark]
gengliangwang commented on PR #45926: URL: https://github.com/apache/spark/pull/45926#issuecomment-2060387686 @itholic Please resolve the conflict so that I can merge this one. Thanks. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47588][CORE] Hive module: Migrate logInfo with variables to structured logging framework [spark]
gengliangwang closed pull request #46086: [SPARK-47588][CORE] Hive module: Migrate logInfo with variables to structured logging framework URL: https://github.com/apache/spark/pull/46086 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47588][CORE] Hive module: Migrate logInfo with variables to structured logging framework [spark]
gengliangwang commented on PR #46086: URL: https://github.com/apache/spark/pull/46086#issuecomment-2060385871 @dongjoon-hyun @HyukjinKwon Thanks for the review. Merging to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [WIP][SPARK-47584][SQL] SQL core: Migrate logWarn with variables to structured logging framework [spark]
panbingkun commented on code in PR #46057: URL: https://github.com/apache/spark/pull/46057#discussion_r1568232939 ## common/utils/src/main/scala/org/apache/spark/internal/LogKey.scala: ## @@ -172,12 +228,16 @@ object LogKey extends Enumeration { val TOPIC_PARTITION = Value val TOTAL_EFFECTIVE_TIME = Value val TOTAL_TIME = Value - val UNSUPPORTED_EXPRESSION = Value Review Comment: Let's unify it as an abbreviation `EXPRESSION` -> `EXPR` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-43861][CORE] Do not delete inprogress log [spark]
bluzy commented on PR #46025: URL: https://github.com/apache/spark/pull/46025#issuecomment-2060354963 @dongjoon-hyun @mridulm I think incorrect inprogress file would be deleted on cleaner's schedule, isn't it? I concen that many spark streaming application can lives forever until needs to be upgraded. It's difficult to set `spark.history.fs.cleaner.maxAge` value while running streaming job. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-47882][SQL] createTableColumnTypes need to be mapped to database types instead of using directly [spark]
yaooqinn opened a new pull request, #46093: URL: https://github.com/apache/spark/pull/46093 … ### What changes were proposed in this pull request? createTableColumnTypes contains Spark SQL data type definitions. The underlying database might not recognize them, boolean for Oracle(v < 23c). ### Why are the changes needed? bugfix ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? new test ### Was this patch authored or co-authored using generative AI tooling? no -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47880][SQL][DOCS] Oracle: Document Mapping Spark SQL Data Types to Oracle [spark]
yaooqinn closed pull request #46092: [SPARK-47880][SQL][DOCS] Oracle: Document Mapping Spark SQL Data Types to Oracle URL: https://github.com/apache/spark/pull/46092 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47879][SQL] Oracle: Use VARCHAR2 instead of VARCHAR for VarcharType mapping [spark]
yaooqinn commented on PR #46091: URL: https://github.com/apache/spark/pull/46091#issuecomment-2060344927 Merged to master -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47880][SQL][DOCS] Oracle: Document Mapping Spark SQL Data Types to Oracle [spark]
yaooqinn commented on PR #46092: URL: https://github.com/apache/spark/pull/46092#issuecomment-2060346342 Merged to master, Thank you @dongjoon-hyun -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47879][SQL] Oracle: Use VARCHAR2 instead of VARCHAR for VarcharType mapping [spark]
yaooqinn closed pull request #46091: [SPARK-47879][SQL] Oracle: Use VARCHAR2 instead of VARCHAR for VarcharType mapping URL: https://github.com/apache/spark/pull/46091 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47858][SPARK-47852][PYTHON][SQL] Refactoring the structure for DataFrame error context [spark]
itholic commented on code in PR #46063: URL: https://github.com/apache/spark/pull/46063#discussion_r1568210917 ## python/pyspark/errors/utils.py: ## @@ -16,9 +16,14 @@ # import re -from typing import Dict, Match - +import functools +import inspect +from typing import Any, Callable, Dict, Match, TypeVar, Type from pyspark.errors.error_classes import ERROR_CLASSES_MAP +from py4j.java_gateway import JavaClass Review Comment: Sure, just updated -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47850][SQL] Support `spark.sql.hive.convertInsertingUnpartitionedTable` [spark]
pan3793 commented on PR #46052: URL: https://github.com/apache/spark/pull/46052#issuecomment-2060319567 cc @ulysses-you who made refactor on this part -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47880][SQL][DOCS] Oracle: Document Mapping Spark SQL Data Types to Oracle [spark]
yaooqinn commented on code in PR #46092: URL: https://github.com/apache/spark/pull/46092#discussion_r1568183291 ## docs/sql-data-sources-jdbc.md: ## @@ -1335,3 +1335,109 @@ as the activated JDBC Driver. + +### Mapping Spark SQL Data Types to Oracle + +The below table describes the data type conversions from Spark SQL Data Types to Oracle data types, +when creating, altering, or writing data to an Oracle table using the built-in jdbc data source with +the Oracle JDBC as the activated JDBC Driver. + + + + + Spark SQL Data Type + Oracle Data Type + Remarks + + + + + BooleanType + NUMBER(1, 0) + BooleanType maps to NUMBER(1, 0) as BOOLEAN is introduced since Oracle Release 23c Review Comment: Yes, previous versions didn't have BOOLEAN -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-43861][CORE] Do not delete inprogress log [spark]
dongjoon-hyun commented on PR #46025: URL: https://github.com/apache/spark/pull/46025#issuecomment-2060295412 Yes, Mridul's comment is correct. I believe the AS-IS behavior is robust and safe and intended one instead of a bug. WDTY, @bluzy ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47880][SQL] Oracle: Document Mapping Spark SQL Data Types to Oracle [spark]
dongjoon-hyun commented on code in PR #46092: URL: https://github.com/apache/spark/pull/46092#discussion_r1568177905 ## docs/sql-data-sources-jdbc.md: ## @@ -1335,3 +1335,109 @@ as the activated JDBC Driver. + +### Mapping Spark SQL Data Types to Oracle + +The below table describes the data type conversions from Spark SQL Data Types to Oracle data types, +when creating, altering, or writing data to an Oracle table using the built-in jdbc data source with +the Oracle JDBC as the activated JDBC Driver. + + + + + Spark SQL Data Type + Oracle Data Type + Remarks + + + + + BooleanType + NUMBER(1, 0) + BooleanType maps to NUMBER(1, 0) as BOOLEAN is introduced since Oracle Release 23c Review Comment: Is it because `Oracle Database 23c` is introduced on April 2023? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47880][SQL] Oracle: Document Mapping Spark SQL Data Types to Oracle [spark]
yaooqinn commented on code in PR #46092: URL: https://github.com/apache/spark/pull/46092#discussion_r1568168654 ## docs/sql-data-sources-jdbc.md: ## @@ -1335,3 +1335,109 @@ as the activated JDBC Driver. + +### Mapping Spark SQL Data Types to Oracle + +The below table describes the data type conversions from Spark SQL Data Types to Oracle data types, +when creating, altering, or writing data to an Oracle table using the built-in jdbc data source with +the Oracle JDBC as the activated JDBC Driver. + + + + + Spark SQL Data Type + Oracle Data Type + Remarks + + + + + BooleanType + NUMBER(1, 0) + BooleanType maps to NUMBER(1, 0) as BOOLEAN is introduced since Oracle Release 23c + + + ByteType + NUMBER(3) + + + + ShortType + NUMBER(5) + + + + IntegerType + NUMBER(10) + + + + LongType + NUMBER(19) + + + + FloatType + NUMBER(19, 4) + + + + DoubleType + NUMBER(19, 4) + + + + DecimalType(p, s) + NUMBER(p,s) + + + + DateType + DATE + + + + TimestampType + TIMESTAMP WITH LOCAL TIME ZONE + + + + TimestampNTZType + TIMESTAMP + + + + StringType + VARCHAR2(255) + For historical reason, a string value has maximum 255 characters + + + BinaryType + BLOB + + + + CharType(n) + CHAR(n) + + + + VarcharType(n) + VARCHAR2(n) Review Comment: It's VARCHAR2 after #46091 It does not make any difference to the users, so I didn't add any remarks as less is more. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47880][SQL] Oracle: Document Mapping Spark SQL Data Types to Oracle [spark]
yaooqinn commented on code in PR #46092: URL: https://github.com/apache/spark/pull/46092#discussion_r1568169069 ## docs/sql-data-sources-jdbc.md: ## @@ -1335,3 +1335,109 @@ as the activated JDBC Driver. + +### Mapping Spark SQL Data Types to Oracle + +The below table describes the data type conversions from Spark SQL Data Types to Oracle data types, +when creating, altering, or writing data to an Oracle table using the built-in jdbc data source with +the Oracle JDBC as the activated JDBC Driver. + + + + + Spark SQL Data Type + Oracle Data Type + Remarks + + + + + BooleanType + NUMBER(1, 0) + BooleanType maps to NUMBER(1, 0) as BOOLEAN is introduced since Oracle Release 23c Review Comment: we might introduce a configuration for users to map BOOLEAN instead. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47879][SQL] Oracle: Use VARCHAR2 instead of VARCHAR for VarcharType mapping [spark]
yaooqinn commented on PR #46091: URL: https://github.com/apache/spark/pull/46091#issuecomment-2060277921 Thank you @dongjoon-hyun -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-47880][SQL] Oracle: Document Mapping Spark SQL Data Types to Oracle [spark]
yaooqinn opened a new pull request, #46092: URL: https://github.com/apache/spark/pull/46092 ### What changes were proposed in this pull request? Documents Mapping Spark SQL Data Types to Oracle ### Why are the changes needed? documentation improvement ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? doc build ![image](https://github.com/apache/spark/assets/8326978/3bc7816a-f887-4f8c-95c0-15ec2c4ac7fb) ### Was this patch authored or co-authored using generative AI tooling? no -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47858][SPARK-47852][PYTHON][SQL] Refactoring the structure for DataFrame error context [spark]
HyukjinKwon commented on code in PR #46063: URL: https://github.com/apache/spark/pull/46063#discussion_r1568166135 ## python/pyspark/errors/utils.py: ## @@ -16,9 +16,14 @@ # import re -from typing import Dict, Match - +import functools +import inspect +from typing import Any, Callable, Dict, Match, TypeVar, Type from pyspark.errors.error_classes import ERROR_CLASSES_MAP +from py4j.java_gateway import JavaClass Review Comment: Can we hide this under `typing.TYPE_CHECKING`? For pure Python library package, we should avoid this whenever possible. Otherwise, the scheduled job would fail. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47858][SPARK-47852][PYTHON][SQL] Refactoring the structure for DataFrame error context [spark]
HyukjinKwon commented on code in PR #46063: URL: https://github.com/apache/spark/pull/46063#discussion_r1568165854 ## python/pyspark/errors/utils.py: ## @@ -119,3 +124,68 @@ def get_message_template(self, error_class: str) -> str: message_template = main_message_template + " " + sub_message_template return message_template + + +def _capture_call_site(pyspark_origin: JavaClass, fragment: str) -> None: +""" +Capture the call site information including file name, line number, and function name. +This function updates the thread-local storage from JVM side (PySparkCurrentOrigin) +with the current call site information when a PySpark API function is called. + +Parameters +-- +pyspark_origin : JavaClass Review Comment: ```suggestion pyspark_origin : py4j.JavaClass ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47858][SPARK-47852][PYTHON][SQL] Refactoring the structure for DataFrame error context [spark]
itholic commented on code in PR #46063: URL: https://github.com/apache/spark/pull/46063#discussion_r1568162931 ## python/pyspark/sql/tests/test_dataframe.py: ## @@ -1011,36 +1011,6 @@ def test_dataframe_error_context(self): pyspark_fragment="eqNullSafe", ) -# DataFrameQueryContext with pysparkLoggingInfo - and -with self.assertRaises(AnalysisException) as pe: -df.withColumn("and_invalid_type", df.id & "string").collect() -self.check_error( -exception=pe.exception, -error_class="DATATYPE_MISMATCH.BINARY_OP_WRONG_TYPE", -message_parameters={ -"inputType": '"BOOLEAN"', -"actualDataType": '"BIGINT"', -"sqlExpr": '"(id AND string)"', -}, -query_context_type=QueryContextType.DataFrame, -pyspark_fragment="and", -) - -# DataFrameQueryContext with pysparkLoggingInfo - or -with self.assertRaises(AnalysisException) as pe: -df.withColumn("or_invalid_type", df.id | "string").collect() -self.check_error( -exception=pe.exception, -error_class="DATATYPE_MISMATCH.BINARY_OP_WRONG_TYPE", -message_parameters={ -"inputType": '"BOOLEAN"', -"actualDataType": '"BIGINT"', -"sqlExpr": '"(id OR string)"', -}, -query_context_type=QueryContextType.DataFrame, -pyspark_fragment="or", -) Review Comment: In short, there was a change in the error message in the previous PR, but the purpose of this work is to only provide additional error context to the existing error message, so it was actually an unexpected change. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47838][BUILD] Upgrade `rocksdbjni` to 8.11.4 [spark]
dongjoon-hyun commented on PR #46065: URL: https://github.com/apache/spark/pull/46065#issuecomment-2060267298 Merged to master for Apache Spark 4.0.0. Thank YOU for the contribution, @neilramaswamy . -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47838][BUILD] Upgrade `rocksdbjni` to 8.11.4 [spark]
dongjoon-hyun closed pull request #46065: [SPARK-47838][BUILD] Upgrade `rocksdbjni` to 8.11.4 URL: https://github.com/apache/spark/pull/46065 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47858][SPARK-47852][PYTHON][SQL] Refactoring the structure for DataFrame error context [spark]
itholic commented on code in PR #46063: URL: https://github.com/apache/spark/pull/46063#discussion_r1568157339 ## python/pyspark/sql/tests/test_dataframe.py: ## @@ -1011,36 +1011,6 @@ def test_dataframe_error_context(self): pyspark_fragment="eqNullSafe", ) -# DataFrameQueryContext with pysparkLoggingInfo - and -with self.assertRaises(AnalysisException) as pe: -df.withColumn("and_invalid_type", df.id & "string").collect() -self.check_error( -exception=pe.exception, -error_class="DATATYPE_MISMATCH.BINARY_OP_WRONG_TYPE", -message_parameters={ -"inputType": '"BOOLEAN"', -"actualDataType": '"BIGINT"', -"sqlExpr": '"(id AND string)"', -}, -query_context_type=QueryContextType.DataFrame, -pyspark_fragment="and", -) - -# DataFrameQueryContext with pysparkLoggingInfo - or -with self.assertRaises(AnalysisException) as pe: -df.withColumn("or_invalid_type", df.id | "string").collect() -self.check_error( -exception=pe.exception, -error_class="DATATYPE_MISMATCH.BINARY_OP_WRONG_TYPE", -message_parameters={ -"inputType": '"BOOLEAN"', -"actualDataType": '"BIGINT"', -"sqlExpr": '"(id OR string)"', -}, -query_context_type=QueryContextType.DataFrame, -pyspark_fragment="or", -) Review Comment: In the previous PR, the error message was changed because we directly called the JVM's "fn" function. **Before the previous PR** ``` >>> df.withColumn("or_invalid_type", df.id | "string").collect() Traceback (most recent call last): py4j.protocol.Py4JError: An error occurred while calling o40.or. Trace: py4j.Py4JException: Method or([class java.lang.String]) does not exist at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:321) at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:329) at py4j.Gateway.invoke(Gateway.java:274) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182) at py4j.ClientServerConnection.run(ClientServerConnection.java:106) at java.base/java.lang.Thread.run(Thread.java:840) ``` **After the previous PR** ``` >>> df.withColumn("or_invalid_type", df.id | "string").collect() Traceback (most recent call last): pyspark.errors.exceptions.captured.AnalysisException: [DATATYPE_MISMATCH.BINARY_OP_DIFF_TYPES] Cannot resolve "(id OR string)" due to data type mismatch: the left and right operands of the binary operator have incompatible types ("BIGINT" and "STRING"). SQLSTATE: 42K09; 'Project [id#0L, (id#0L OR string) AS or_invalid_type#2] +- Range (0, 10, step=1, splits=Some(12)) ``` We changed the approach which is not calling "fn" directly from current PR, so the error message has returned to its original state. Additionally, JVM also doesn't have `DataFrameQueryContext` for "and", "or" and "not" functions which is aligning with current behavior. I guess such an exception for `and`, `or`, `not` is caused by: https://github.com/apache/spark/blob/6c827c10dc15e03178277a415c0e26e2d9d3a2f9/python/pyspark/sql/column.py#L402-L408 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47765][SQL] Add SET COLLATION to parser rules [spark]
cloud-fan commented on PR #45946: URL: https://github.com/apache/spark/pull/45946#issuecomment-2060252622 Shall we fail this command if the string collation feature flag is turned off? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-47879][SQL] Oracle: Use VARCHAR2 instead of VARCHAR for VarcharType mapping [spark]
yaooqinn opened a new pull request, #46091: URL: https://github.com/apache/spark/pull/46091 ### What changes were proposed in this pull request? Use VARCHAR2 instead of VARCHAR for VarcharType mapping on the write-side. VARCHAR is a synonym of VARCHAR2 but it's unsteady in future Oracle releases. ### Why are the changes needed? Oracle official documentation says: > Do not use the VARCHAR data type. Use the VARCHAR2 data type instead. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? new tests ### Was this patch authored or co-authored using generative AI tooling? no -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47871][SQL] Oracle: Map TimestampType to TIMESTAMP WITH LOCAL TIME ZONE [spark]
yaooqinn commented on PR #46080: URL: https://github.com/apache/spark/pull/46080#issuecomment-2060233886 Thank you very much as always @dongjoon-hyun -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47858][SPARK-47852][PYTHON][SQL] Refactoring the structure for DataFrame error context [spark]
ueshin commented on code in PR #46063: URL: https://github.com/apache/spark/pull/46063#discussion_r1568131652 ## python/pyspark/sql/tests/test_dataframe.py: ## @@ -1011,36 +1011,6 @@ def test_dataframe_error_context(self): pyspark_fragment="eqNullSafe", ) -# DataFrameQueryContext with pysparkLoggingInfo - and -with self.assertRaises(AnalysisException) as pe: -df.withColumn("and_invalid_type", df.id & "string").collect() -self.check_error( -exception=pe.exception, -error_class="DATATYPE_MISMATCH.BINARY_OP_WRONG_TYPE", -message_parameters={ -"inputType": '"BOOLEAN"', -"actualDataType": '"BIGINT"', -"sqlExpr": '"(id AND string)"', -}, -query_context_type=QueryContextType.DataFrame, -pyspark_fragment="and", -) - -# DataFrameQueryContext with pysparkLoggingInfo - or -with self.assertRaises(AnalysisException) as pe: -df.withColumn("or_invalid_type", df.id | "string").collect() -self.check_error( -exception=pe.exception, -error_class="DATATYPE_MISMATCH.BINARY_OP_WRONG_TYPE", -message_parameters={ -"inputType": '"BOOLEAN"', -"actualDataType": '"BIGINT"', -"sqlExpr": '"(id OR string)"', -}, -query_context_type=QueryContextType.DataFrame, -pyspark_fragment="or", -) Review Comment: I'm wondering why we remove these tests? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47838][BUILD] Upgrade `rocksdbjni` to 8.11.4 [spark]
neilramaswamy commented on PR #46065: URL: https://github.com/apache/spark/pull/46065#issuecomment-2060226204 @dongjoon-hyun, should be ready to merge now. Appreciate your feedback! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47858][SPARK-47852][PYTHON][SQL] Refactoring the structure for DataFrame error context [spark]
itholic commented on code in PR #46063: URL: https://github.com/apache/spark/pull/46063#discussion_r1568119580 ## sql/api/src/main/scala/org/apache/spark/sql/catalyst/trees/QueryContexts.scala: ## @@ -160,6 +160,8 @@ case class DataFrameQueryContext( val pysparkFragment: String = pysparkErrorContext.map(_._1).getOrElse("") val pysparkCallSite: String = pysparkErrorContext.map(_._2).getOrElse("") + PySparkCurrentOrigin.clear() Review Comment: Yeah, `PySparkCurrentOrigin` should be cleared on the Python side to ensure consistent operation. It is working properly now. Thanks, @ueshin -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47870][SQL] Optimize predicate after push extra predicate through join [spark]
zml1206 commented on PR #46085: URL: https://github.com/apache/spark/pull/46085#issuecomment-2060202987 cc @cloud-fan -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47870][SQL] Optimize predicate after push extra predicate through join [spark]
zml1206 commented on code in PR #46085: URL: https://github.com/apache/spark/pull/46085#discussion_r1568115344 ## sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/FilterPushdownSuite.scala: ## @@ -46,15 +46,30 @@ class FilterPushdownSuite extends PlanTest { PushDownPredicates) :: Nil } + object PushExtraPredicateThroughJoinOptimize extends RuleExecutor[LogicalPlan] { Review Comment: In order not to affect the previous UT, add a new Optimize. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-43861][CORE] Do not delete inprogress log [spark]
mridulm commented on PR #46025: URL: https://github.com/apache/spark/pull/46025#issuecomment-2060196960 Note that when driver crashes, the event file remains with `.inprogress` suffix. Not deleting these files would result in filling up the event directory - and eventually fail all jobs (depending on the fs limits for number of files per dir). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47858][SPARK-47852][PYTHON][SQL] Refactoring the structure for DataFrame error context [spark]
itholic commented on code in PR #46063: URL: https://github.com/apache/spark/pull/46063#discussion_r1568104641 ## sql/api/src/main/scala/org/apache/spark/sql/catalyst/trees/QueryContexts.scala: ## @@ -160,6 +160,8 @@ case class DataFrameQueryContext( val pysparkFragment: String = pysparkErrorContext.map(_._1).getOrElse("") val pysparkCallSite: String = pysparkErrorContext.map(_._2).getOrElse("") + PySparkCurrentOrigin.clear() Review Comment: Thanks @ueshin, will try -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47810][SQL] Replace equivalent expression to <=> in join condition [spark]
cloud-fan commented on code in PR #45999: URL: https://github.com/apache/spark/pull/45999#discussion_r1568103738 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeJoinCondition.scala: ## @@ -0,0 +1,45 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.optimizer + +import org.apache.spark.sql.catalyst.expressions.{And, EqualNullSafe, EqualTo, IsNull, Or, PredicateHelper} +import org.apache.spark.sql.catalyst.plans.logical.{Join, LogicalPlan} +import org.apache.spark.sql.catalyst.rules.Rule +import org.apache.spark.sql.catalyst.trees.TreePattern.{JOIN, OR} + +/** + * Replaces `t1.id is null and t2.id is null or t1.id = t2.id` to `t1.id <=> t2.id` + * in join condition for better performance. + */ +object OptimizeJoinCondition extends Rule[LogicalPlan] with PredicateHelper { + override def apply(plan: LogicalPlan): LogicalPlan = plan.transformWithPruning( Review Comment: We can easily disable a QO rule by setting `spark.sql.optimizer.excludedRules` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47767][SQL] Show offset value in TakeOrderedAndProjectExec [spark]
guixiaowen commented on PR #45931: URL: https://github.com/apache/spark/pull/45931#issuecomment-2060168379 > Could you add one test case like `EXPLAIN ... LIMIT ... OFFSET ... ORDER BY ...` at https://github.com/apache/spark/blob/master/sql/core/src/test/resources/sql-tests/inputs/explain.sql ? @beliefer Thank you for you review. I just changed the prompt message. This will not affect the UT test. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47810][SQL] Replace equivalent expression to <=> in join condition [spark]
zml1206 commented on code in PR #45999: URL: https://github.com/apache/spark/pull/45999#discussion_r1568084128 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeJoinCondition.scala: ## @@ -0,0 +1,45 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.optimizer + +import org.apache.spark.sql.catalyst.expressions.{And, EqualNullSafe, EqualTo, IsNull, Or, PredicateHelper} +import org.apache.spark.sql.catalyst.plans.logical.{Join, LogicalPlan} +import org.apache.spark.sql.catalyst.rules.Rule +import org.apache.spark.sql.catalyst.trees.TreePattern.{JOIN, OR} + +/** + * Replaces `t1.id is null and t2.id is null or t1.id = t2.id` to `t1.id <=> t2.id` + * in join condition for better performance. + */ +object OptimizeJoinCondition extends Rule[LogicalPlan] with PredicateHelper { + override def apply(plan: LogicalPlan): LogicalPlan = plan.transformWithPruning( Review Comment: Sorry, I don’t understand what kind of flag it is. Can you give me an example? Thank you. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47876][PYTHON][DOCS] Improve docstring of mapInArrow [spark]
xinrong-meng closed pull request #46088: [SPARK-47876][PYTHON][DOCS] Improve docstring of mapInArrow URL: https://github.com/apache/spark/pull/46088 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47876][PYTHON][DOCS] Improve docstring of mapInArrow [spark]
xinrong-meng commented on PR #46088: URL: https://github.com/apache/spark/pull/46088#issuecomment-2060150462 Thank you all, merged to master! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47810][SQL] Replace equivalent expression to <=> in join condition [spark]
zml1206 commented on code in PR #45999: URL: https://github.com/apache/spark/pull/45999#discussion_r1568083598 ## sql/core/src/test/scala/org/apache/spark/sql/DataFrameJoinSuite.scala: ## @@ -621,4 +621,14 @@ class DataFrameJoinSuite extends QueryTest checkAnswer(joined, Row("x", null, null)) checkAnswer(joined.filter($"new".isNull), Row("x", null, null)) } + + test("SPARK-47810: replace equivalent expression to <=> in join condition") { Review Comment: Added. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-46375][DOCS] Add user guide for Python data source API [spark]
HyukjinKwon commented on code in PR #46089: URL: https://github.com/apache/spark/pull/46089#discussion_r1568048732 ## python/docs/source/user_guide/sql/python_data_source.rst: ## @@ -0,0 +1,139 @@ +.. Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + +..http://www.apache.org/licenses/LICENSE-2.0 + +.. Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. + +== +Python Data Source API +== + +.. currentmodule:: pyspark.sql + +Overview + +The Python Data Source API is a new feature introduced in Spark 4.0, enabling developers to read from custom data sources and write to custom data sinks in Python. +This guide provides a comprehensive overview of the API and instructions on how to create, use, and manage Python data sources. + + +Creating a Python Data Source +- +To create a custom Python data source, you'll need to subclass the :class:`DataSource` base classes and implement the necessary methods for reading and writing data. + +This example demonstrates creating a simple data source to generate synthetic data using the `faker` library. Ensure the `faker` library is installed and accessible in your Python environment. + +**Step 1: Define the Data Source** + +Start by creating a new subclass of :class:`DataSource`. Define the source name, schema, and reader logic as follows: + +.. code-block:: python + +from pyspark.sql.datasource import DataSource, DataSourceReader +from pyspark.sql.types import StructType + +class FakeDataSource(DataSource): +""" +A fake data source for PySpark to generate synthetic data using the `faker` library. +Options: +- numRows: specify number of rows to generate. Default value is 3. +""" + +@classmethod +def name(cls): +return "fake" + +def schema(self): +return "name string, date string, zipcode string, state string" + +def reader(self, schema: StructType): +return FakeDataSourceReader(schema, self.options) + + +**Step 2: Implement the Reader** + +Define the reader logic to generate synthetic data. Use the `faker` library to populate each field in the schema. + +.. code-block:: python + +class FakeDataSourceReader(DataSourceReader): + +def __init__(self, schema, options): +self.schema: StructType = schema +self.options = options + +def read(self, partition): +from faker import Faker +fake = Faker() +# Note: every value in this `self.options` dictionary is a string. +num_rows = int(self.options.get("numRows", 3)) +for _ in range(num_rows): +row = [] +for field in self.schema.fields: +value = getattr(fake, field.name)() +row.append(value) +yield tuple(row) + + +Using a Python Data Source Review Comment: Could we add a small section for `pip install` case too? I have an example at https://github.com/hyukjinkwon/pyspark-jira. We should probably say that this is not supported with Spark Connect client side (thus it has to be installed in Spark Connect server side). ## python/docs/source/user_guide/sql/python_data_source.rst: ## @@ -0,0 +1,139 @@ +.. Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + +..http://www.apache.org/licenses/LICENSE-2.0 + +.. Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. + +== +Python Data Source API +== + +.. currentmodule:: pyspark.sql + +Overview + +The Python
Re: [PR] [SPARK-46375][DOCS] Add user guide for Python data source API [spark]
HyukjinKwon commented on PR #46089: URL: https://github.com/apache/spark/pull/46089#issuecomment-2060146837 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47876][PYTHON][DOCS] Improve docstring of mapInArrow [spark]
xinrong-meng commented on PR #46088: URL: https://github.com/apache/spark/pull/46088#issuecomment-2060148294 Good idea! I'll file a separate PR @zhengruifeng thanks! Thanks @allisonwang-db I'll create tickets under the umbrella. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-46375][DOCS] Add user guide for Python data source API [spark]
HyukjinKwon closed pull request #46089: [SPARK-46375][DOCS] Add user guide for Python data source API URL: https://github.com/apache/spark/pull/46089 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [WIP][SPARK-47763][CONNECT][TESTS] Enable local-cluster tests with pyspark-connect package [spark]
HyukjinKwon opened a new pull request, #46090: URL: https://github.com/apache/spark/pull/46090 ### What changes were proposed in this pull request? TBD ### Why are the changes needed? TBD ### Does this PR introduce _any_ user-facing change? TBD ### How was this patch tested? TBD ### Was this patch authored or co-authored using generative AI tooling? TBD -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47846][SQL] Add support for Variant type in from_json expression [spark]
harshmotw-db commented on PR #46046: URL: https://github.com/apache/spark/pull/46046#issuecomment-2060102585 @chenhao-db can you please look at this whenever you're free? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47418][SQL] Add hand-crafted implementations for lowercase uni… [spark]
HyukjinKwon commented on PR #46082: URL: https://github.com/apache/spark/pull/46082#issuecomment-2060101653 Mind making the PR title complete? It's truncated. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47416][SQL] Add new functions to CollationBenchmark [spark]
HyukjinKwon commented on code in PR #46078: URL: https://github.com/apache/spark/pull/46078#discussion_r1568055547 ## sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/CollationBenchmark.scala: ## @@ -100,6 +100,90 @@ abstract class CollationBenchmarkBase extends BenchmarkBase { ) benchmark.run() } + + def benchmarkUTFStringContains( + collationTypes: Seq[String], + utf8Strings: Seq[UTF8String]): Unit = { +val sublistStrings = utf8Strings + +val benchmark = new Benchmark( + "collation unit benchmarks - contains", + utf8Strings.size * 10, + warmupTime = 10.seconds, + output = output) +collationTypes.foreach(collationType => { Review Comment: can you use `foreach { collationType =>` instead -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47845][SQL][PYTHON][CONNECT] Support Column type in split function for scala and python [spark]
HyukjinKwon commented on PR #46045: URL: https://github.com/apache/spark/pull/46045#issuecomment-2060090501 I am fine with this change -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-46375][DOCS] Add user guide for Python data source API [spark]
HyukjinKwon commented on code in PR #46089: URL: https://github.com/apache/spark/pull/46089#discussion_r1568048989 ## python/docs/source/user_guide/sql/python_data_source.rst: ## @@ -0,0 +1,139 @@ +.. Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + +..http://www.apache.org/licenses/LICENSE-2.0 + +.. Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. + +== +Python Data Source API +== + +.. currentmodule:: pyspark.sql + +Overview + +The Python Data Source API is a new feature introduced in Spark 4.0, enabling developers to read from custom data sources and write to custom data sinks in Python. +This guide provides a comprehensive overview of the API and instructions on how to create, use, and manage Python data sources. + + +Creating a Python Data Source Review Comment: I would have a guide for static vs runtime cases. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-46375][DOCS] Add user guide for Python data source API [spark]
HyukjinKwon commented on code in PR #46089: URL: https://github.com/apache/spark/pull/46089#discussion_r1568048732 ## python/docs/source/user_guide/sql/python_data_source.rst: ## @@ -0,0 +1,139 @@ +.. Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + +..http://www.apache.org/licenses/LICENSE-2.0 + +.. Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. + +== +Python Data Source API +== + +.. currentmodule:: pyspark.sql + +Overview + +The Python Data Source API is a new feature introduced in Spark 4.0, enabling developers to read from custom data sources and write to custom data sinks in Python. +This guide provides a comprehensive overview of the API and instructions on how to create, use, and manage Python data sources. + + +Creating a Python Data Source +- +To create a custom Python data source, you'll need to subclass the :class:`DataSource` base classes and implement the necessary methods for reading and writing data. + +This example demonstrates creating a simple data source to generate synthetic data using the `faker` library. Ensure the `faker` library is installed and accessible in your Python environment. + +**Step 1: Define the Data Source** + +Start by creating a new subclass of :class:`DataSource`. Define the source name, schema, and reader logic as follows: + +.. code-block:: python + +from pyspark.sql.datasource import DataSource, DataSourceReader +from pyspark.sql.types import StructType + +class FakeDataSource(DataSource): +""" +A fake data source for PySpark to generate synthetic data using the `faker` library. +Options: +- numRows: specify number of rows to generate. Default value is 3. +""" + +@classmethod +def name(cls): +return "fake" + +def schema(self): +return "name string, date string, zipcode string, state string" + +def reader(self, schema: StructType): +return FakeDataSourceReader(schema, self.options) + + +**Step 2: Implement the Reader** + +Define the reader logic to generate synthetic data. Use the `faker` library to populate each field in the schema. + +.. code-block:: python + +class FakeDataSourceReader(DataSourceReader): + +def __init__(self, schema, options): +self.schema: StructType = schema +self.options = options + +def read(self, partition): +from faker import Faker +fake = Faker() +# Note: every value in this `self.options` dictionary is a string. +num_rows = int(self.options.get("numRows", 3)) +for _ in range(num_rows): +row = [] +for field in self.schema.fields: +value = getattr(fake, field.name)() +row.append(value) +yield tuple(row) + + +Using a Python Data Source Review Comment: Could we add a small section for `pip install` case too? I have an example at https://github.com/hyukjinkwon/pyspark-jira. We should probably say that this is not supported with Spark Connect client side (thus it has to be installed in Spark Connect server side). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47877][SS][CONNECT] Speed up test_parity_listener [spark]
HyukjinKwon closed pull request #46072: [SPARK-47877][SS][CONNECT] Speed up test_parity_listener URL: https://github.com/apache/spark/pull/46072 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47877][SS][CONNECT] Speed up test_parity_listener [spark]
HyukjinKwon commented on PR #46072: URL: https://github.com/apache/spark/pull/46072#issuecomment-2060083261 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47875][CORE] Remove `spark.deploy.recoverySerializer` [spark]
dongjoon-hyun closed pull request #46087: [SPARK-47875][CORE] Remove `spark.deploy.recoverySerializer` URL: https://github.com/apache/spark/pull/46087 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47760][SPARK-47763][CONNECT][TESTS] Reeanble Avro and Protobuf function doctests [spark]
HyukjinKwon closed pull request #46055: [SPARK-47760][SPARK-47763][CONNECT][TESTS] Reeanble Avro and Protobuf function doctests URL: https://github.com/apache/spark/pull/46055 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47760][SPARK-47763][CONNECT][TESTS] Reeanble Avro and Protobuf function doctests [spark]
HyukjinKwon commented on PR #46055: URL: https://github.com/apache/spark/pull/46055#issuecomment-2060081075 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47816][CONNECT][DOCS] Document the lazy evaluation of views in `spark.{sql, table}` [spark]
allisonwang-db commented on code in PR #46007: URL: https://github.com/apache/spark/pull/46007#discussion_r1568042050 ## python/pyspark/sql/session.py: ## @@ -1630,6 +1630,13 @@ def sql( --- :class:`DataFrame` +Notes +- +In Spark Classic, a temporary view referenced in `spark.sql` is resolved immediately, Review Comment: How about temp functions? ## python/pyspark/sql/session.py: ## @@ -1630,6 +1630,13 @@ def sql( --- :class:`DataFrame` +Notes +- +In Spark Classic, a temporary view referenced in `spark.sql` is resolved immediately, +while in Spark Connect it is lazily evaluated. Review Comment: I think this note might be very confusing to users, as data frames in Spark are all lazily evaluated, right? Maybe we can say "it is lazily analyzed". We should probably document this as a behavior change for Spark Connect. I am pretty sure there are other behavior changes. Also does this lazy analysis apply to persistent tables and views as well? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47868][CONNECT] Fix recursion limit error in SparkConnectPlanner and SparkSession [spark]
zhengruifeng closed pull request #46075: [SPARK-47868][CONNECT] Fix recursion limit error in SparkConnectPlanner and SparkSession URL: https://github.com/apache/spark/pull/46075 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47868][CONNECT] Fix recursion limit error in SparkConnectPlanner and SparkSession [spark]
zhengruifeng commented on PR #46075: URL: https://github.com/apache/spark/pull/46075#issuecomment-2060071663 merged to master -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-46375][DOCS] Add user guide for Python data source API [spark]
allisonwang-db opened a new pull request, #46089: URL: https://github.com/apache/spark/pull/46089 ### What changes were proposed in this pull request? This PR adds a new user guide for the Python data source API with a simple example. More examples (including streaming) will be added in the future. ### Why are the changes needed? To improve the documentation ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? doctest ### Was this patch authored or co-authored using generative AI tooling? No -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47876][PYTHON][DOCS] Improve docstring of mapInArrow [spark]
zhengruifeng commented on PR #46088: URL: https://github.com/apache/spark/pull/46088#issuecomment-2060067749 the doc of `mapInArrow` is similar to `mapInPandas`, shall we refine the latter too? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47618][CORE] Use `Magic Committer` for all S3 buckets by default [spark]
dongjoon-hyun commented on PR #45740: URL: https://github.com/apache/spark/pull/45740#issuecomment-2060044971 Thank you for your feedback, @steveloughran . Ya, as you mentioned, this is blocked by exactly those two configurations. ``` spark.sql.parquet.output.committer.class=org.apache.spark.internal.io.cloud.BindingParquetOutputCommitter spark.sql.sources.commitProtocolClass=org.apache.spark.internal.io.cloud.PathOutputCommitProtocol ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47877][SS][CONNECT] Speed up test_parity_listener [spark]
WweiL commented on PR #46072: URL: https://github.com/apache/spark/pull/46072#issuecomment-2060036826 @HyukjinKwon Can you take a look? Thank you! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47810][SQL] Replace equivalent expression to <=> in join condition [spark]
anton5798 commented on code in PR #45999: URL: https://github.com/apache/spark/pull/45999#discussion_r1568012332 ## sql/core/src/test/scala/org/apache/spark/sql/DataFrameJoinSuite.scala: ## @@ -621,4 +621,14 @@ class DataFrameJoinSuite extends QueryTest checkAnswer(joined, Row("x", null, null)) checkAnswer(joined.filter($"new".isNull), Row("x", null, null)) } + + test("SPARK-47810: replace equivalent expression to <=> in join condition") { Review Comment: Let's also add a summary of plan change (before/after) and quantified performance benefit in the PR description -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47810][SQL] Replace equivalent expression to <=> in join condition [spark]
anton5798 commented on code in PR #45999: URL: https://github.com/apache/spark/pull/45999#discussion_r1568011225 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeJoinCondition.scala: ## @@ -0,0 +1,45 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.optimizer + +import org.apache.spark.sql.catalyst.expressions.{And, EqualNullSafe, EqualTo, IsNull, Or, PredicateHelper} +import org.apache.spark.sql.catalyst.plans.logical.{Join, LogicalPlan} +import org.apache.spark.sql.catalyst.rules.Rule +import org.apache.spark.sql.catalyst.trees.TreePattern.{JOIN, OR} + +/** + * Replaces `t1.id is null and t2.id is null or t1.id = t2.id` to `t1.id <=> t2.id` + * in join condition for better performance. + */ +object OptimizeJoinCondition extends Rule[LogicalPlan] with PredicateHelper { + override def apply(plan: LogicalPlan): LogicalPlan = plan.transformWithPruning( Review Comment: Even though this is a simple rule, we should add a flag guard here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47810][SQL] Replace equivalent expression to <=> in join condition [spark]
anton5798 commented on code in PR #45999: URL: https://github.com/apache/spark/pull/45999#discussion_r1568005755 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeJoinCondition.scala: ## @@ -0,0 +1,55 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.optimizer + +import org.apache.spark.sql.catalyst.expressions.{And, EqualNullSafe, EqualTo, IsNull, Or, PredicateHelper} +import org.apache.spark.sql.catalyst.plans.logical.{Join, LogicalPlan} +import org.apache.spark.sql.catalyst.rules.Rule +import org.apache.spark.sql.catalyst.trees.TreePattern.JOIN + +/** + * Replaces `t1.id is null and t2.id is null or t1.id = t2.id` to `t1.id <=> t2.id` + * in join condition for better performance. + */ +object OptimizeJoinCondition extends Rule[LogicalPlan] with PredicateHelper { + override def apply(plan: LogicalPlan): LogicalPlan = plan.transformWithPruning( +_.containsPattern(JOIN), ruleId) { +case join @ Join(left, right, joinType, condition, hint) if condition.nonEmpty => + val predicates = condition.map(splitConjunctivePredicates).get + var replace = false + val newPredicates = predicates.map { +case Or(EqualTo(l, r), And(IsNull(c1), IsNull(c2))) Review Comment: If the target scenario just has `[l = r OR (isnull(l) and isnull(r))]` somewhere in the expression tree, then I think don't need to `splitConjunctivePredicates`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47875][CORE] Remove `spark.deploy.recoverySerializer` [spark]
dongjoon-hyun commented on PR #46087: URL: https://github.com/apache/spark/pull/46087#issuecomment-2060008470 I removed the missed SPARK-46205 test case. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47588][CORE] Hive module: Migrate logInfo with variables to structured logging framework [spark]
gengliangwang commented on code in PR #46086: URL: https://github.com/apache/spark/pull/46086#discussion_r1567997561 ## sql/hive/src/main/scala/org/apache/spark/sql/hive/client/IsolatedClientLoader.scala: ## @@ -20,18 +20,18 @@ package org.apache.spark.sql.hive.client import java.io.{File, PrintStream} import java.lang.reflect.InvocationTargetException import java.net.{URL, URLClassLoader} -import java.util +import java.util Review Comment: Thanks, now `./dev/scalastyle` should pass. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47875][CORE] Remove `spark.deploy.recoverySerializer` [spark]
dongjoon-hyun commented on PR #46087: URL: https://github.com/apache/spark/pull/46087#issuecomment-2060007634 Yes, there are other commits about `compression` code and some neutral changes. I believe it will be okay and the final goal is to bring it back again. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-47876][PYTHON][DOCS] Improve docstring of mapInArrow [spark]
xinrong-meng opened a new pull request, #46088: URL: https://github.com/apache/spark/pull/46088 ### What changes were proposed in this pull request? Improve docstring of mapInArrow: - "using a Python native function that takes and outputs a PyArrow's RecordBatch" is confusing cause the function takes and outputs "ITERATOR of RecordBatchs" instead. - "All columns are passed together as an iterator of pyarrow.RecordBatchs" easily mislead users to think the entire DataFrame will be passed together, "a batch of rows" is used instead. ### Why are the changes needed? More accurate and clear docstring. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Doc change only. ### Was this patch authored or co-authored using generative AI tooling? No. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47875][CORE] Remove `spark.deploy.recoverySerializer` [spark]
dongjoon-hyun commented on PR #46087: URL: https://github.com/apache/spark/pull/46087#issuecomment-2060004457 Thank you so much for swift help. I'll make it sure that all CIes passes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47588][CORE] Hive module: Migrate logInfo with variables to structured logging framework [spark]
dongjoon-hyun commented on code in PR #46086: URL: https://github.com/apache/spark/pull/46086#discussion_r1567994515 ## sql/hive/src/main/scala/org/apache/spark/sql/hive/client/IsolatedClientLoader.scala: ## @@ -20,18 +20,18 @@ package org.apache.spark.sql.hive.client import java.io.{File, PrintStream} import java.lang.reflect.InvocationTargetException import java.net.{URL, URLClassLoader} -import java.util +import java.util Review Comment: ? This change still looks weird. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47875][CORE] Remove `spark.deploy.recoverySerializer` [spark]
viirya commented on PR #46087: URL: https://github.com/apache/spark/pull/46087#issuecomment-2060002347 Pending CI. Thanks @dongjoon-hyun -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47588][CORE] Hive module: Migrate logInfo with variables to structured logging framework [spark]
gengliangwang commented on code in PR #46086: URL: https://github.com/apache/spark/pull/46086#discussion_r1567992633 ## sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala: ## @@ -31,10 +32,10 @@ import org.apache.hadoop.hive.metastore.api.hive_metastoreConstants.DDL_TIME import org.apache.hadoop.hive.ql.metadata.HiveException import org.apache.hadoop.hive.serde.serdeConstants.SERIALIZATION_FORMAT import org.apache.thrift.TException - Review Comment: Thanks, it is fixed now. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47875][CORE] Remove `spark.deploy.recoverySerializer` [spark]
dongjoon-hyun commented on PR #46087: URL: https://github.com/apache/spark/pull/46087#issuecomment-2059998699 Sorry but could you review this reverting PR, @viirya ? While I've running this, I found my mistake. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-47875][CORE] Remove `spark.deploy.recoverySerializer` [spark]
dongjoon-hyun opened a new pull request, #46087: URL: https://github.com/apache/spark/pull/46087 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? ### Was this patch authored or co-authored using generative AI tooling? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47588][CORE] Hive module: Migrate logInfo with variables to structured logging framework [spark]
dongjoon-hyun commented on code in PR #46086: URL: https://github.com/apache/spark/pull/46086#discussion_r1567987357 ## sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala: ## @@ -31,10 +32,10 @@ import org.apache.hadoop.hive.metastore.api.hive_metastoreConstants.DDL_TIME import org.apache.hadoop.hive.ql.metadata.HiveException import org.apache.hadoop.hive.serde.serdeConstants.SERIALIZATION_FORMAT import org.apache.thrift.TException - Review Comment: ? Could you double-check your linter, @gengliangwang ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47838][BUILD] Upgrade `rocksdbjni` to 8.11.4 [spark]
neilramaswamy commented on PR #46065: URL: https://github.com/apache/spark/pull/46065#issuecomment-2059969162 @dongjoon-hyun numbers are still approximately the same (I just updated with the latest results), a few are better. Seems safe to merge when CI passes. Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47805][SS] Implementing TTL for MapState [spark]
ericm-db commented on code in PR #45991: URL: https://github.com/apache/spark/pull/45991#discussion_r1567940592 ## sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MapStateImplWithTTL.scala: ## @@ -0,0 +1,265 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.sql.execution.streaming + +import org.apache.spark.internal.Logging +import org.apache.spark.sql.Encoder +import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder +import org.apache.spark.sql.execution.streaming.TransformWithStateKeyValueRowSchema.{COMPOSITE_KEY_ROW_SCHEMA, VALUE_ROW_SCHEMA_WITH_TTL} +import org.apache.spark.sql.execution.streaming.state.{PrefixKeyScanStateEncoderSpec, StateStore, StateStoreErrors} +import org.apache.spark.sql.streaming.{MapState, TTLConfig} +import org.apache.spark.util.NextIterator + + +class MapStateImplWithTTL[K, V]( + store: StateStore, + stateName: String, + keyExprEnc: ExpressionEncoder[Any], + userKeyEnc: Encoder[K], + valEncoder: Encoder[V], + ttlConfig: TTLConfig, + batchTimestampMs: Long) extends CompositeKeyTTLStateImpl(stateName, store, batchTimestampMs) + with MapState[K, V] with Logging { + + private val keySerializer = keyExprEnc.createSerializer() + private val stateTypesEncoder = new CompositeKeyStateEncoder( +keySerializer, userKeyEnc, valEncoder, COMPOSITE_KEY_ROW_SCHEMA, stateName, hasTtl = true) + + private val ttlExpirationMs = +StateTTL.calculateExpirationTimeForDuration(ttlConfig.ttlDuration, batchTimestampMs) + + initialize() + + private def initialize(): Unit = { +store.createColFamilyIfAbsent(stateName, COMPOSITE_KEY_ROW_SCHEMA, VALUE_ROW_SCHEMA_WITH_TTL, + PrefixKeyScanStateEncoderSpec(COMPOSITE_KEY_ROW_SCHEMA, 1)) + } + + /** Whether state exists or not. */ + override def exists(): Boolean = { +iterator().nonEmpty + } + + /** Get the state value if it exists */ + override def getValue(key: K): V = { +StateStoreErrors.requireNonNullStateValue(key, stateName) +val encodedCompositeKey = stateTypesEncoder.encodeCompositeKey(key) +val retRow = store.get(encodedCompositeKey, stateName) + +if (retRow != null) { + val resState = stateTypesEncoder.decodeValue(retRow) + + if (!stateTypesEncoder.isExpired(retRow, batchTimestampMs)) { +resState + } else { +null.asInstanceOf[V] + } +} else { + null.asInstanceOf[V] +} + } + + /** Check if the user key is contained in the map */ + override def containsKey(key: K): Boolean = { +StateStoreErrors.requireNonNullStateValue(key, stateName) +getValue(key) != null + } + + /** Update value for given user key */ + override def updateValue(key: K, value: V): Unit = { +StateStoreErrors.requireNonNullStateValue(key, stateName) +StateStoreErrors.requireNonNullStateValue(value, stateName) +val encodedValue = stateTypesEncoder.encodeValue(value, ttlExpirationMs) +val encodedCompositeKey = stateTypesEncoder.encodeCompositeKey(key) +store.put(encodedCompositeKey, encodedValue, stateName) +val serializedGroupingKey = stateTypesEncoder.serializeGroupingKey() +val serializedUserKey = stateTypesEncoder.serializeUserKey(key) +upsertTTLForStateKey(ttlExpirationMs, serializedGroupingKey, serializedUserKey) + } + + /** Get the map associated with grouping key */ + override def iterator(): Iterator[(K, V)] = { +val encodedGroupingKey = stateTypesEncoder.encodeGroupingKey() +val unsafeRowPairIterator = store.prefixScan(encodedGroupingKey, stateName) +new NextIterator[(K, V)] { + override protected def getNext(): (K, V) = { +val iter = unsafeRowPairIterator.dropWhile { rowPair => + stateTypesEncoder.isExpired(rowPair.value, batchTimestampMs) +} +if (iter.hasNext) { + val currentRowPair = iter.next() + val key = stateTypesEncoder.decodeCompositeKey(currentRowPair.key) + val value = stateTypesEncoder.decodeValue(currentRowPair.value) + (key, value) +} else { + finished = true + null.asInstanceOf[(K, V)] +} + } + + override protected def close(): Unit = {} +} + } + +
Re: [PR] [SPARK-47590][SQL] Hive-thriftserver: Migrate logWarn with variables to structured logging framework [spark]
gengliangwang commented on PR #45923: URL: https://github.com/apache/spark/pull/45923#issuecomment-2059908947 Thanks, merging to master -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47590][SQL] Hive-thriftserver: Migrate logWarn with variables to structured logging framework [spark]
gengliangwang closed pull request #45923: [SPARK-47590][SQL] Hive-thriftserver: Migrate logWarn with variables to structured logging framework URL: https://github.com/apache/spark/pull/45923 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SHUFFLE] [WIP] Prototype: store shuffle file on external storage like S3 [spark]
steveloughran commented on PR #34864: URL: https://github.com/apache/spark/pull/34864#issuecomment-2059907838 @michaelbilow hadoop s3a is on v2 sdk; the com.amazonaws classes are not on the CP and amazon are slowly stopping support. you cannot for example use the lower latency S3 express stores with it. Like I say: I think you would be better off using the Hue file system APIs to talk to s3. If there are aspects of s3 storage which aren't available through the API -or just very inefficiently due to the effort to preserve the Posix metaphor, then lets fix the API so that other stores can offer the same features, and other apps can pick up. For example, here's our ongoing delete API for iceberg and other manifest-based tables https://github.com/apache/hadoop/pull/6726 It maps to s3 bulk delete calls, but there's scope to add to other stores (we now actually want to add it as a page-size == 1 option for all filesystems as it simplifies iceberg integration). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47618][CORE] Use `Magic Committer` for all S3 buckets by default [spark]
steveloughran commented on PR #45740: URL: https://github.com/apache/spark/pull/45740#issuecomment-2059899891 I have no problems with the PR; we have made it the default in our releases. This could be a good time to revisit "why there's some separate PathOutputCommitter" stuff; originally it was because spark built against releases without the new PathOutputCommitter interface. This no longer holds: could anything needed from it be pulled up into the main committer? One recurrent troublespot we have with committing work is parquet; it requires all committers to be a subclass of ParquetOutputCommitter, hence the (ugly, brittle) wrapping stuff. Life will be a lot easier if parquet didn't mind if it was any PathOutputCommitter -it would just skip the schema writing. Of course, we then come up against the fact that parquet still wants to build against hadoop 2.8. Everyone needs to move on, especially as hadoop java11+ support is 3.2.x+ only. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47627][SQL] Add SQL MERGE syntax to enable schema evolution [spark]
xupefei commented on PR #45748: URL: https://github.com/apache/spark/pull/45748#issuecomment-2059899197 > @xupefei could you provide more details in the PR description? For example, what is the difference with/without `WITH SCHEMA EVOLUTION` Hi @gengliangwang, I added to the PR description as you advised. Please have a look! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47805][SS] Implementing TTL for MapState [spark]
ericm-db commented on code in PR #45991: URL: https://github.com/apache/spark/pull/45991#discussion_r1567908966 ## sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MapStateImplWithTTL.scala: ## @@ -0,0 +1,265 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.sql.execution.streaming + +import org.apache.spark.internal.Logging +import org.apache.spark.sql.Encoder +import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder +import org.apache.spark.sql.execution.streaming.TransformWithStateKeyValueRowSchema.{COMPOSITE_KEY_ROW_SCHEMA, VALUE_ROW_SCHEMA_WITH_TTL} +import org.apache.spark.sql.execution.streaming.state.{PrefixKeyScanStateEncoderSpec, StateStore, StateStoreErrors} +import org.apache.spark.sql.streaming.{MapState, TTLConfig} +import org.apache.spark.util.NextIterator + + +class MapStateImplWithTTL[K, V]( + store: StateStore, + stateName: String, + keyExprEnc: ExpressionEncoder[Any], + userKeyEnc: Encoder[K], + valEncoder: Encoder[V], + ttlConfig: TTLConfig, + batchTimestampMs: Long) extends CompositeKeyTTLStateImpl(stateName, store, batchTimestampMs) + with MapState[K, V] with Logging { + + private val keySerializer = keyExprEnc.createSerializer() + private val stateTypesEncoder = new CompositeKeyStateEncoder( +keySerializer, userKeyEnc, valEncoder, COMPOSITE_KEY_ROW_SCHEMA, stateName, hasTtl = true) + + private val ttlExpirationMs = +StateTTL.calculateExpirationTimeForDuration(ttlConfig.ttlDuration, batchTimestampMs) + + initialize() + + private def initialize(): Unit = { +store.createColFamilyIfAbsent(stateName, COMPOSITE_KEY_ROW_SCHEMA, VALUE_ROW_SCHEMA_WITH_TTL, + PrefixKeyScanStateEncoderSpec(COMPOSITE_KEY_ROW_SCHEMA, 1)) + } + + /** Whether state exists or not. */ + override def exists(): Boolean = { +iterator().nonEmpty + } + + /** Get the state value if it exists */ + override def getValue(key: K): V = { +StateStoreErrors.requireNonNullStateValue(key, stateName) +val encodedCompositeKey = stateTypesEncoder.encodeCompositeKey(key) +val retRow = store.get(encodedCompositeKey, stateName) + +if (retRow != null) { + val resState = stateTypesEncoder.decodeValue(retRow) + + if (!stateTypesEncoder.isExpired(retRow, batchTimestampMs)) { +resState + } else { +null.asInstanceOf[V] + } +} else { + null.asInstanceOf[V] +} + } + + /** Check if the user key is contained in the map */ + override def containsKey(key: K): Boolean = { +StateStoreErrors.requireNonNullStateValue(key, stateName) +getValue(key) != null + } + + /** Update value for given user key */ + override def updateValue(key: K, value: V): Unit = { +StateStoreErrors.requireNonNullStateValue(key, stateName) +StateStoreErrors.requireNonNullStateValue(value, stateName) +val encodedValue = stateTypesEncoder.encodeValue(value, ttlExpirationMs) +val encodedCompositeKey = stateTypesEncoder.encodeCompositeKey(key) +store.put(encodedCompositeKey, encodedValue, stateName) +val serializedGroupingKey = stateTypesEncoder.serializeGroupingKey() +val serializedUserKey = stateTypesEncoder.serializeUserKey(key) +upsertTTLForStateKey(ttlExpirationMs, serializedGroupingKey, serializedUserKey) + } + + /** Get the map associated with grouping key */ + override def iterator(): Iterator[(K, V)] = { +val encodedGroupingKey = stateTypesEncoder.encodeGroupingKey() +val unsafeRowPairIterator = store.prefixScan(encodedGroupingKey, stateName) +new NextIterator[(K, V)] { + override protected def getNext(): (K, V) = { +val iter = unsafeRowPairIterator.dropWhile { rowPair => + stateTypesEncoder.isExpired(rowPair.value, batchTimestampMs) +} +if (iter.hasNext) { + val currentRowPair = iter.next() + val key = stateTypesEncoder.decodeCompositeKey(currentRowPair.key) + val value = stateTypesEncoder.decodeValue(currentRowPair.value) + (key, value) +} else { + finished = true + null.asInstanceOf[(K, V)] +} + } + + override protected def close(): Unit = {} +} + } + +
Re: [PR] [SPARK-47588][CORE] Hive module: Migrate logInfo with variables to structured logging framework [spark]
gengliangwang commented on PR #46086: URL: https://github.com/apache/spark/pull/46086#issuecomment-2059874634 cc @panbingkun @itholic -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-47588][CORE] Hive module: Migrate logInfo with variables to structured logging framework [spark]
gengliangwang opened a new pull request, #46086: URL: https://github.com/apache/spark/pull/46086 ### What changes were proposed in this pull request? Migrate logInfo in Hive module with variables to structured logging framework. ### Why are the changes needed? To enhance Apache Spark's logging system by implementing structured logging. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? GA tests ### Was this patch authored or co-authored using generative AI tooling? No -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47594] Connector module: Migrate logInfo with variables to structured logging framework [spark]
gengliangwang closed pull request #46022: [SPARK-47594] Connector module: Migrate logInfo with variables to structured logging framework URL: https://github.com/apache/spark/pull/46022 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47594] Connector module: Migrate logInfo with variables to structured logging framework [spark]
gengliangwang commented on PR #46022: URL: https://github.com/apache/spark/pull/46022#issuecomment-2059867235 Thanks, merging to master -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47858][SPARK-47852][PYTHON][SQL] Refactoring the structure for DataFrame error context [spark]
ueshin commented on code in PR #46063: URL: https://github.com/apache/spark/pull/46063#discussion_r1567889537 ## sql/api/src/main/scala/org/apache/spark/sql/catalyst/trees/QueryContexts.scala: ## @@ -160,6 +160,8 @@ case class DataFrameQueryContext( val pysparkFragment: String = pysparkErrorContext.map(_._1).getOrElse("") val pysparkCallSite: String = pysparkErrorContext.map(_._2).getOrElse("") + PySparkCurrentOrigin.clear() Review Comment: If we clear `PySparkCurrentOrigin` right after `get`, seems like `lit` here: https://github.com/apache/spark/blob/ccb87fb26481acecbd75682ddc8fa9e5dd52e8cb/sql/core/src/main/scala/org/apache/spark/sql/Column.scala#L171 consumes the PySpark context in this case. We may want to clear it after we call the JVM method. ```py _capture_call_site(func.__name__) try: return func(*args, **kwargs) finally: pyspark_origin.clean() ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47838][BUILD] Upgrade rocksdbjni to 8.11.4 [spark]
neilramaswamy commented on PR #46065: URL: https://github.com/apache/spark/pull/46065#issuecomment-2059736653 The JDK 21 results are _slightly_ slower than what they were before, which is odd since nothing really changed between these released. So I'll confirm if this is just variance by running again. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47845][SQL][PYTHON][CONNECT] Support Column type in split function for scala and python [spark]
CTCC1 commented on code in PR #46045: URL: https://github.com/apache/spark/pull/46045#discussion_r1566772319 ## python/pyspark/sql/connect/functions/builtin.py: ## @@ -2476,8 +2476,26 @@ def repeat(col: "ColumnOrName", n: Union["ColumnOrName", int]) -> Column: repeat.__doc__ = pysparkfuncs.repeat.__doc__ -def split(str: "ColumnOrName", pattern: str, limit: int = -1) -> Column: -return _invoke_function("split", _to_col(str), lit(pattern), lit(limit)) +def split( +str: "ColumnOrName", +pattern: Union[Column, str], +limit: Union["ColumnOrName", int] = -1, +) -> Column: +# work around shadowing of str in the input variable name +from builtins import str as py_str + +if isinstance(pattern, py_str): +_pattern = lit(pattern) +elif isinstance(pattern, Column): +_pattern = pattern +else: +raise PySparkTypeError( +error_class="NOT_COLUMN_OR_STR", +message_parameters={"arg_name": "pattern", "arg_type": type(pattern).__name__}, +) + +limit = lit(limit) if isinstance(limit, int) else _to_col(limit) +return _invoke_function("split", _to_col(str), _pattern, limit) Review Comment: Thanks for the suggestion, this is simpler for sure! The only concern is that we will not raise `PySparkTypeError` if `pattern` is passed in for a type other than `Column` or `str`, and it will form a UnresolvedFunction. Is raising such error early a requirement for connect? For now I simplified the code but kept the error raise. Lmk if we want to skip raising the error. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47845][SQL][PYTHON][CONNECT] Support Column type in split function for scala and python [spark]
CTCC1 commented on code in PR #46045: URL: https://github.com/apache/spark/pull/46045#discussion_r1567800324 ## connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/functions.scala: ## @@ -4197,6 +4197,20 @@ object functions { */ def split(str: Column, pattern: String): Column = Column.fn("split", str, lit(pattern)) + /** + * Splits str around matches of the given pattern. + * + * @param str + * a string expression to split + * @param pattern + * a column of string representing a regular expression. The regex string should be a Java + * regular expression. + * + * @group string_funcs + * @since 4.0.0 + */ + def split(str: Column, pattern: Column): Column = Column.fn("split", str, pattern) Review Comment: thanks for the pointer, fixed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47839][SQL] Fix aggregate bug in RewriteWithExpression [spark]
kelvinjian-db commented on code in PR #46034: URL: https://github.com/apache/spark/pull/46034#discussion_r1567767274 ## sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/RewriteWithExpressionSuite.scala: ## @@ -29,25 +29,29 @@ import org.apache.spark.sql.types.IntegerType class RewriteWithExpressionSuite extends PlanTest { object Optimizer extends RuleExecutor[LogicalPlan] { -val batches = Batch("Rewrite With expression", Once, RewriteWithExpression) :: Nil +val batches = Batch("Rewrite With expression", Once, + PullOutGroupingExpressions, + RewriteWithExpression) :: Nil } private val testRelation = LocalRelation($"a".int, $"b".int) private val testRelation2 = LocalRelation($"x".int, $"y".int) test("simple common expression") { val a = testRelation.output.head -val commonExprDef = CommonExpressionDef(a) -val ref = new CommonExpressionRef(commonExprDef) -val plan = testRelation.select(With(ref + ref, Seq(commonExprDef)).as("col")) +val expr = With.create((a, 0)) { case Seq(ref) => Review Comment: i ended up changing the tests to manually extract the common expression IDs so we don't need `With.create` anymore -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47627][SQL] Add SQL MERGE syntax to enable schema evolution [spark]
gengliangwang commented on PR #45748: URL: https://github.com/apache/spark/pull/45748#issuecomment-2059645090 @xupefei could you provide more details in the PR description? For example, what is the difference with/without `WITH SCHEMA EVOLUTION` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-47745] Add License to Spark Operator repository [spark-kubernetes-operator]
viirya commented on PR #3: URL: https://github.com/apache/spark-kubernetes-operator/pull/3#issuecomment-2059585966 Thank you @jiangzho @dongjoon-hyun -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org