[GitHub] [spark] LuciferYang commented on a diff in pull request #38779: [WIP][SPARK-41244][UI] Introducing a Protobuf serializer for UI data on KV store
LuciferYang commented on code in PR #38779: URL: https://github.com/apache/spark/pull/38779#discussion_r1031155887 ## project/SparkBuild.scala: ## @@ -607,9 +607,25 @@ object SparkParallelTestGrouping { object Core { import scala.sys.process.Process + import BuildCommons.protoVersion def buildenv = Process(Seq("uname")).!!.trim.replaceFirst("[^A-Za-z0-9].*", "").toLowerCase def bashpath = Process(Seq("where", "bash")).!!.split("[\r\n]+").head.replace('\\', '/') lazy val settings = Seq( +// Setting version for the protobuf compiler. This has to be propagated to every sub-project +// even if the project is not using it. +PB.protocVersion := BuildCommons.protoVersion, +// For some reason the resolution from the imported Maven build does not work for some +// of these dependendencies that we need to shade later on. +libraryDependencies ++= { + Seq( +"io.grpc" % "protoc-gen-grpc-java" % BuildCommons.gprcVersion asProtocPlugin(), +"com.google.protobuf" % "protobuf-java" % protoVersion % "protobuf" + ) +}, +(Compile / PB.targets) := Seq( + PB.gens.java -> (Compile / sourceManaged).value, Review Comment: Also need `assembly` and `rename` for protobuf? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] EnricoMi commented on a diff in pull request #38223: [SPARK-40770][PYTHON] Improved error messages for applyInPandas for schema mismatch
EnricoMi commented on code in PR #38223: URL: https://github.com/apache/spark/pull/38223#discussion_r1031150953 ## python/pyspark/sql/tests/pandas/test_pandas_cogrouped_map.py: ## @@ -165,100 +148,191 @@ def merge_pandas(lft, _): ) def test_apply_in_pandas_not_returning_pandas_dataframe(self): -left = self.data1 -right = self.data2 +self._test_merge_error( +fn=lambda lft, rgt: lft.size + rgt.size, +error_class=PythonException, +error_message_regex="Return type of the user-defined function " +"should be pandas.DataFrame, but is ", +) + +def test_apply_in_pandas_returning_column_names(self): +self._test_merge(fn=lambda lft, rgt: pd.merge(lft, rgt, on=["id", "k"])) +def test_apply_in_pandas_returning_no_column_names(self): def merge_pandas(lft, rgt): -return lft.size + rgt.size +res = pd.merge(lft, rgt, on=["id", "k"]) +res.columns = range(res.columns.size) +return res -with QuietTest(self.sc): -with self.assertRaisesRegex( -PythonException, -"Return type of the user-defined function should be pandas.DataFrame, " -"but is ", -): -( -left.groupby("id") -.cogroup(right.groupby("id")) -.applyInPandas(merge_pandas, "id long, k int, v int, v2 int") -.collect() -) +self._test_merge(fn=merge_pandas) -def test_apply_in_pandas_returning_wrong_number_of_columns(self): -left = self.data1 -right = self.data2 +def test_apply_in_pandas_returning_column_names_sometimes(self): +def merge_pandas(lft, rgt): +res = pd.merge(lft, rgt, on=["id", "k"]) +if 0 in lft["id"] and lft["id"][0] % 2 == 0: +return res +res.columns = range(res.columns.size) +return res + +self._test_merge(fn=merge_pandas) +def test_apply_in_pandas_returning_wrong_column_names(self): def merge_pandas(lft, rgt): if 0 in lft["id"] and lft["id"][0] % 2 == 0: lft["add"] = 0 if 0 in rgt["id"] and rgt["id"][0] % 3 == 0: rgt["more"] = 1 return pd.merge(lft, rgt, on=["id", "k"]) -with QuietTest(self.sc): -with self.assertRaisesRegex( -PythonException, -"Number of columns of the returned pandas.DataFrame " -"doesn't match specified schema. Expected: 4 Actual: 6", -): -( -# merge_pandas returns two columns for even keys while we set schema to four -left.groupby("id") -.cogroup(right.groupby("id")) -.applyInPandas(merge_pandas, "id long, k int, v int, v2 int") -.collect() -) +self._test_merge_error( +fn=merge_pandas, +error_class=PythonException, +error_message_regex="Column names of the returned pandas.DataFrame " +"do not match specified schema. Unexpected: add, more Schema: id, k, v, v2\n", +) -def test_apply_in_pandas_returning_empty_dataframe(self): -left = self.data1 -right = self.data2 +# with very large schema, missing and unexpected is limited to 5 +# and the schema is abbreviated in the error message +schema = "id long, k long, mean double, " + ", ".join( +f"column_with_long_column_name_{no} integer" for no in range(35) +) +self._test_merge_error( +fn=lambda lft, rgt: pd.DataFrame( +[ +( +lft.id, +lft.k, +lft.v.mean(), +) ++ tuple(lft.v.mean() for _ in range(7)) +], +columns=["id", "k", "mean"] + [f"extra_column_{no} integer" for no in range(7)], +), +output_schema=schema, +error_class=PythonException, +error_message_regex="Column names of the returned pandas\\.DataFrame " +"do not match specified schema\\. " +"Missing \\(first 5 of 35\\): column_with_long_column_name_0," +" column_with_long_column_name_1, column_with_long_column_name_10," +" column_with_long_column_name_11, column_with_long_column_name_12 " +"Unexpected \\(first 5 of 7\\): extra_column_0 integer, extra_column_1 integer," +" extra_column_2 integer, extra_column_3 integer, extra_column_4 integer " +"Schema: id, k, mean, column_with_long_column_name_0, column_with_long_column_name_1," +"
[GitHub] [spark] EnricoMi commented on a diff in pull request #38223: [SPARK-40770][PYTHON] Improved error messages for applyInPandas for schema mismatch
EnricoMi commented on code in PR #38223: URL: https://github.com/apache/spark/pull/38223#discussion_r1031150574 ## python/pyspark/worker.py: ## @@ -146,7 +146,74 @@ def verify_result_type(result): ) -def wrap_cogrouped_map_pandas_udf(f, return_type, argspec): +def verify_pandas_result(result, return_type, assign_cols_by_name): +import pandas as pd + +if not isinstance(result, pd.DataFrame): +raise TypeError( +"Return type of the user-defined function should be " +"pandas.DataFrame, but is {}".format(type(result)) +) + +# check the schema of the result only if it is not empty or has columns +if not result.empty or len(result.columns) != 0: +# if any column name of the result is a string +# the column names of the result have to match the return type +# see create_array in pyspark.sql.pandas.serializers.ArrowStreamPandasSerializer +field_names = set([field.name for field in return_type.fields]) +column_names = set(result.columns) +if ( +assign_cols_by_name +and any(isinstance(name, str) for name in result.columns) +and column_names != field_names +): +limit = 5 Review Comment: @HyukjinKwon I have removed the schema and columns limit. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #38787: [SPARK-41251][PS][INFRA] Upgrade pandas from 1.5.1 to 1.5.2
HyukjinKwon commented on PR #38787: URL: https://github.com/apache/spark/pull/38787#issuecomment-1326072793 cc @Yikun @itholic @xinrong-meng FYI -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on a diff in pull request #38779: [WIP][SPARK-41244][UI] Introducing a Protobuf serializer for UI data on KV store
LuciferYang commented on code in PR #38779: URL: https://github.com/apache/spark/pull/38779#discussion_r1031148805 ## core/pom.xml: ## @@ -616,6 +621,50 @@ + +org.apache.maven.plugins +maven-shade-plugin + + false + true + + + com.google.protobuf:* + + + + + com.google.protobuf + ${spark.shade.packageName}.spark-core.protobuf + +com.google.protobuf.** + + + + + + +com.github.os72 +protoc-jar-maven-plugin +3.11.4 + + + +generate-test-sources Review Comment: `generate-test-sources`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on a diff in pull request #38779: [WIP][SPARK-41244][UI] Introducing a Protobuf serializer for UI data on KV store
LuciferYang commented on code in PR #38779: URL: https://github.com/apache/spark/pull/38779#discussion_r1031147004 ## core/pom.xml: ## @@ -616,6 +621,50 @@ + +org.apache.maven.plugins Review Comment: SPARK-40593 and SPARK-41215 do some work to support Spark 3.4 to be compiled on CentOS6&7. I think the core module needs similar work, but may not be in this pr -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on pull request #38774: [SPARK-41240][CONNECT][BUILD][INFRA] Upgrade `Protobuf` to 3.19.5
zhengruifeng commented on PR #38774: URL: https://github.com/apache/spark/pull/38774#issuecomment-1326069519 we may also need to change this place https://github.com/apache/spark/blob/master/connector/connect/src/main/buf.gen.yaml#L30 cc @grundprinzip -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] panbingkun opened a new pull request, #38787: [SPARK-41251][PS][INFRA] Upgrade pandas from 1.5.1 to 1.5.2
panbingkun opened a new pull request, #38787: URL: https://github.com/apache/spark/pull/38787 ### What changes were proposed in this pull request? This PR proposes upgrading pandas to 1.5.2, for pandas API on Spark. New version of pandas (1.5.12) was released at Nov 22, 2022. Release Notes: https://pandas.pydata.org/pandas-docs/dev/whatsnew/v1.5.2.html ### Why are the changes needed? We should follow the behavior of latest pandas, and support it. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on pull request #38770: [SPARK-41238][CONNECT][PYTHON] Support more built-in datatypes
zhengruifeng commented on PR #38770: URL: https://github.com/apache/spark/pull/38770#issuecomment-1326058481 > Can you test both nullable=true and nullable=false case? the tests added in `test_schema` covers those cases -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] wilfred-s commented on pull request #38780: [SPARK-41185][K8S][DOCS] Remove ARM limitation for YuniKorn from docs
wilfred-s commented on PR #38780: URL: https://github.com/apache/spark/pull/38780#issuecomment-1326057909 already did that directly after it was logged -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on a diff in pull request #38782: [SPARK-38728][SQL] Test the error class: FAILED_RENAME_PATH
LuciferYang commented on code in PR #38782: URL: https://github.com/apache/spark/pull/38782#discussion_r1031134476 ## sql/core/src/test/scala/org/apache/spark/sql/errors/QueryExecutionErrorsSuite.scala: ## @@ -637,6 +637,33 @@ class QueryExecutionErrorsSuite sqlState = "0A000") } + test("FAILED_RENAME_PATH: rename when destination path already exists") { +var srcPath: Path = null +var dstPath: Path = null + +val e = intercept[SparkFileAlreadyExistsException]( + withTempPath { p => +val basePath = new Path(p.getAbsolutePath) +val fm = new FileSystemBasedCheckpointFileManager(basePath, new Configuration()) +srcPath = new Path(s"$basePath/src") +fm.createAtomic(srcPath, overwriteIfPossible = true).close() +dstPath = new Path(s"$basePath/dst") +fm.createAtomic(dstPath, overwriteIfPossible = true).close() + +fm.renameTempFile(srcPath, dstPath, false) Review Comment: Could we add this case through the user perspective API -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #38598: [SPARK-41097][CORE][SQL][SS][PROTOBUF] Remove redundant collection conversion base on Scala 2.13 code
LuciferYang commented on PR #38598: URL: https://github.com/apache/spark/pull/38598#issuecomment-1326049324 ready to merge -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] itholic commented on pull request #38664: [SPARK-41147][SQL] Assign a name to the legacy error class `_LEGACY_ERROR_TEMP_1042`
itholic commented on PR #38664: URL: https://github.com/apache/spark/pull/38664#issuecomment-1326044896 Thanks for the review, @MaxGekk Just addressed the comments! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gengliangwang commented on pull request #38783: [SPARK-41247][BUILD] Unify the Protobuf versions in Spark connect and Protobuf connector
gengliangwang commented on PR #38783: URL: https://github.com/apache/spark/pull/38783#issuecomment-1326027832 > shall we also shade protobuf in Spark Core in this PR? Then the proto version is unified in all the places. In this PR there is no actual usage. I would prefer to shade it in https://github.com/apache/spark/pull/38779 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on a diff in pull request #38768: [SPARK-41230][CONNECT][PYTHON] Remove `str` from Aggregate expression type
zhengruifeng commented on code in PR #38768: URL: https://github.com/apache/spark/pull/38768#discussion_r1031102422 ## python/pyspark/sql/connect/dataframe.py: ## @@ -51,24 +52,11 @@ class GroupingFrame(object): Review Comment: not related to this PR, but shall we rename it `GroupedData` to be the same with pyspark? ## python/pyspark/sql/connect/dataframe.py: ## @@ -164,8 +156,20 @@ def selectExpr(self, *expr: Union[str, List[str]]) -> "DataFrame": return DataFrame.withPlan(plan.Project(self._plan, *sql_expr), session=self._session) -def agg(self, exprs: Optional[GroupingFrame.MeasuresType]) -> "DataFrame": -return self.groupBy().agg(exprs) +def agg(self, *exprs: Union[Expression, Dict[str, str]]) -> "DataFrame": +if not exprs: +raise ValueError("Argument 'exprs' must not be empty") + +if len(exprs) == 1 and isinstance(exprs[0], dict): +measures = [] +for e, fun in exprs[0].items(): +measures.append(ScalarFunctionExpression(fun, Column(e))) Review Comment: ```suggestion measures = [ScalarFunctionExpression(f, Column(e)) for e, f in exprs[0].items()] ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #38783: [SPARK-41247][BUILD] Unify the Protobuf versions in Spark connect and Protobuf connector
cloud-fan commented on PR #38783: URL: https://github.com/apache/spark/pull/38783#issuecomment-1326008658 shall we also shade protobuf in Spark Core in this PR? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #38783: [SPARK-41247][BUILD] Unify the Protobuf versions in Spark connect and Protobuf connector
cloud-fan commented on code in PR #38783: URL: https://github.com/apache/spark/pull/38783#discussion_r1031094917 ## pom.xml: ## @@ -118,7 +118,10 @@ 2.19.0 3.3.4 -2.5.0 + + 2.5.0 + Review Comment: ```suggestion ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk closed pull request #38710: [SPARK-41179][SQL] Assign a name to the error class _LEGACY_ERROR_TEMP_1092
MaxGekk closed pull request #38710: [SPARK-41179][SQL] Assign a name to the error class _LEGACY_ERROR_TEMP_1092 URL: https://github.com/apache/spark/pull/38710 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk commented on pull request #38710: [SPARK-41179][SQL] Assign a name to the error class _LEGACY_ERROR_TEMP_1092
MaxGekk commented on PR #38710: URL: https://github.com/apache/spark/pull/38710#issuecomment-1326004383 +1, LGTM. Merging to master. Thank you, @panbingkun. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on pull request #38786: [SPARK-41250][CONNECT][PYTHON] DataFrame.to_pandas should not return optional pandas dataframe
amaliujia commented on PR #38786: URL: https://github.com/apache/spark/pull/38786#issuecomment-1326002119 @zhengruifeng @HyukjinKwon cc @xinrong-meng @grundprinzip -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk commented on a diff in pull request #38769: [SPARK-41228][SQL] Rename & Improve error message for `COLUMN_NOT_IN_GROUP_BY_CLAUSE`.
MaxGekk commented on code in PR #38769: URL: https://github.com/apache/spark/pull/38769#discussion_r1031089197 ## core/src/main/resources/error/error-classes.json: ## @@ -785,6 +779,13 @@ "Malformed Protobuf messages are detected in message deserialization. Parse Mode: . To process malformed protobuf message as null result, try setting the option 'mode' as 'PERMISSIVE'." ] }, + "MISSING_AGGREGATION" : { +"message" : [ + "The non-aggregating expression is based on columns which are not participating in the GROUP BY clause.", + "Add the columns or the expression to the GROUP BY, aggregate the expression, or use \"any_value()\" if you do not care which of the values within a group is returned." Review Comment: > "anyvalue(c2)" .. "any_value(c2)" hmm, could you open a PR and override `prettyName` in `AnyValue`, and pass `prettyName` to `FirstLast.validateIgnoreNullExpr()`. Also, cc @vitaliili-db -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia opened a new pull request, #38786: [SPARK-41250][CONNECT][PYTHON] DataFrame.to_pandas should not return optional pandas dataframe
amaliujia opened a new pull request, #38786: URL: https://github.com/apache/spark/pull/38786 ### What changes were proposed in this pull request? The server guarantees to send at least one arrow batch with schema even there is empty result. In this case, ` `DataFrame.to_pandas` always can return a Pandas DataFrame. This PR decouples the client side execution path for `Command` and `Relation` to remove `Optional` from the returneed type of `DataFrame.to_pandas`. ### Why are the changes needed? API coverage. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? UT -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] itholic commented on a diff in pull request #38769: [SPARK-41228][SQL] Rename & Improve error message for `COLUMN_NOT_IN_GROUP_BY_CLAUSE`.
itholic commented on code in PR #38769: URL: https://github.com/apache/spark/pull/38769#discussion_r1031083666 ## core/src/main/resources/error/error-classes.json: ## @@ -785,6 +779,13 @@ "Malformed Protobuf messages are detected in message deserialization. Parse Mode: . To process malformed protobuf message as null result, try setting the option 'mode' as 'PERMISSIVE'." ] }, + "MISSING_AGGREGATION" : { +"message" : [ + "The non-aggregating expression is based on columns which are not participating in the GROUP BY clause.", + "Add the columns or the expression to the GROUP BY, aggregate the expression, or use \"any_value()\" if you do not care which of the values within a group is returned." Review Comment: I just tried your suggestion, and I checked the first one is also affected, so it looks a bit wired IMO. For example, the error message will be shown like: ``` ... The non-aggregating expression "anyvalue(c2)" is based on columns ... ... aggregate the expression, or use "any_value(c2)" if you do not care ... ``` 1. Maybe should we separate the parameters into 2 parts something like and ?? ```scala new AnalysisException( errorClass = "MISSING_AGGREGATION", messageParameters = Map( "expression" -> toSQLExpr(expression) "expressionAnyValue" -> toSQLExpr(new AnyValue(expression))) ``` 2. Or maybe we can simply recommend to use the `any_value()` function as below: ``` ... aggregate the expression, or use `any_value()` for if you do not care ... ``` something like that? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] itholic commented on a diff in pull request #38769: [SPARK-41228][SQL] Rename & Improve error message for `COLUMN_NOT_IN_GROUP_BY_CLAUSE`.
itholic commented on code in PR #38769: URL: https://github.com/apache/spark/pull/38769#discussion_r1031083666 ## core/src/main/resources/error/error-classes.json: ## @@ -785,6 +779,13 @@ "Malformed Protobuf messages are detected in message deserialization. Parse Mode: . To process malformed protobuf message as null result, try setting the option 'mode' as 'PERMISSIVE'." ] }, + "MISSING_AGGREGATION" : { +"message" : [ + "The non-aggregating expression is based on columns which are not participating in the GROUP BY clause.", + "Add the columns or the expression to the GROUP BY, aggregate the expression, or use \"any_value()\" if you do not care which of the values within a group is returned." Review Comment: I just tried your suggestion, and I checked the first one is also affected, so it looks a bit wired IMO. For example, the error message will be shown like: ``` ... The non-aggregating expression "anyvalue(c2)" is based on columns ... ... aggregate the expression, or use "any_value(c2)" if you do not care ... ``` 1. Maybe should we separate the parameters into 2 parts something like and ?? ```scala new AnalysisException( errorClass = "MISSING_AGGREGATION", messageParameters = Map( "expression" -> toSQLExpr(expression) "expression" -> toSQLExpr(new AnyValue(expression))) ``` 2. Or maybe we can simply recommend to use the `any_value()` function as below: ``` ... aggregate the expression, or use `any_value()` for if you do not care ... ``` something like that? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] itholic commented on a diff in pull request #38769: [SPARK-41228][SQL] Rename & Improve error message for `COLUMN_NOT_IN_GROUP_BY_CLAUSE`.
itholic commented on code in PR #38769: URL: https://github.com/apache/spark/pull/38769#discussion_r1031083666 ## core/src/main/resources/error/error-classes.json: ## @@ -785,6 +779,13 @@ "Malformed Protobuf messages are detected in message deserialization. Parse Mode: . To process malformed protobuf message as null result, try setting the option 'mode' as 'PERMISSIVE'." ] }, + "MISSING_AGGREGATION" : { +"message" : [ + "The non-aggregating expression is based on columns which are not participating in the GROUP BY clause.", + "Add the columns or the expression to the GROUP BY, aggregate the expression, or use \"any_value()\" if you do not care which of the values within a group is returned." Review Comment: I just tried your suggestion, and I checked the first one is also affected, so it looks a bit wired IMO. For example, the error message will be shown like: ``` ... The non-aggregating expression "anyvalue(c2)" is based on columns ... ... aggregate the expression, or use "any_value(c2)" if you do not care ... ``` 1. Maybe should we separate the parameters into 2 parts something like and ?? 2. Or maybe we can simply recommend to use the `any_value()` function as below: ``` ... aggregate the expression, or use `any_value()` for if you do not care ... ``` something like that? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] itholic commented on pull request #38769: [SPARK-41228][SQL] Rename & Improve error message for `COLUMN_NOT_IN_GROUP_BY_CLAUSE`.
itholic commented on PR #38769: URL: https://github.com/apache/spark/pull/38769#issuecomment-1325994942 Fixed the Python test first. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] itholic commented on a diff in pull request #38769: [SPARK-41228][SQL] Rename & Improve error message for `COLUMN_NOT_IN_GROUP_BY_CLAUSE`.
itholic commented on code in PR #38769: URL: https://github.com/apache/spark/pull/38769#discussion_r1031085119 ## core/src/main/resources/error/error-classes.json: ## @@ -785,6 +779,13 @@ "Malformed Protobuf messages are detected in message deserialization. Parse Mode: . To process malformed protobuf message as null result, try setting the option 'mode' as 'PERMISSIVE'." ] }, + "MISSING_AGGREGATION" : { +"message" : [ + "The non-aggregating expression is based on columns which are not participating in the GROUP BY clause.", + "Add the columns or the expression to the GROUP BY, aggregate the expression, or use \"any_value()\" if you do not care which of the values within a group is returned." Review Comment: Or maybe we simply recommend to use the `any_value()` function as below: ``` ... aggregate the expression, or use `any_value()` for if you do not care ... ``` something like that? ## core/src/main/resources/error/error-classes.json: ## @@ -785,6 +779,13 @@ "Malformed Protobuf messages are detected in message deserialization. Parse Mode: . To process malformed protobuf message as null result, try setting the option 'mode' as 'PERMISSIVE'." ] }, + "MISSING_AGGREGATION" : { +"message" : [ + "The non-aggregating expression is based on columns which are not participating in the GROUP BY clause.", + "Add the columns or the expression to the GROUP BY, aggregate the expression, or use \"any_value()\" if you do not care which of the values within a group is returned." Review Comment: Or maybe we can simply recommend to use the `any_value()` function as below: ``` ... aggregate the expression, or use `any_value()` for if you do not care ... ``` something like that? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk closed pull request #38707: [SPARK-41176][SQL] Assign a name to the error class _LEGACY_ERROR_TEMP_1042
MaxGekk closed pull request #38707: [SPARK-41176][SQL] Assign a name to the error class _LEGACY_ERROR_TEMP_1042 URL: https://github.com/apache/spark/pull/38707 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk commented on pull request #38707: [SPARK-41176][SQL] Assign a name to the error class _LEGACY_ERROR_TEMP_1042
MaxGekk commented on PR #38707: URL: https://github.com/apache/spark/pull/38707#issuecomment-1325992375 +1, LGTM. Merging to master. Thank you, @panbingkun and @srielau for review. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] itholic commented on a diff in pull request #38769: [SPARK-41228][SQL] Rename & Improve error message for `COLUMN_NOT_IN_GROUP_BY_CLAUSE`.
itholic commented on code in PR #38769: URL: https://github.com/apache/spark/pull/38769#discussion_r1031083666 ## core/src/main/resources/error/error-classes.json: ## @@ -785,6 +779,13 @@ "Malformed Protobuf messages are detected in message deserialization. Parse Mode: . To process malformed protobuf message as null result, try setting the option 'mode' as 'PERMISSIVE'." ] }, + "MISSING_AGGREGATION" : { +"message" : [ + "The non-aggregating expression is based on columns which are not participating in the GROUP BY clause.", + "Add the columns or the expression to the GROUP BY, aggregate the expression, or use \"any_value()\" if you do not care which of the values within a group is returned." Review Comment: I just tried your suggestion, and I checked the first one is also affected, so it looks a bit wired IMO. For example, the error message will be shown like: ``` ... The non-aggregating expression "anyvalue(c2)" is based on columns ... ... aggregate the expression, or use "any_value(c2)" if you do not care ... ``` Maybe should we separate the parameters into 2 parts like and ?? ## core/src/main/resources/error/error-classes.json: ## @@ -785,6 +779,13 @@ "Malformed Protobuf messages are detected in message deserialization. Parse Mode: . To process malformed protobuf message as null result, try setting the option 'mode' as 'PERMISSIVE'." ] }, + "MISSING_AGGREGATION" : { +"message" : [ + "The non-aggregating expression is based on columns which are not participating in the GROUP BY clause.", + "Add the columns or the expression to the GROUP BY, aggregate the expression, or use \"any_value()\" if you do not care which of the values within a group is returned." Review Comment: I just tried your suggestion, and I checked the first one is also affected, so it looks a bit wired IMO. For example, the error message will be shown like: ``` ... The non-aggregating expression "anyvalue(c2)" is based on columns ... ... aggregate the expression, or use "any_value(c2)" if you do not care ... ``` Maybe should we separate the parameters into 2 parts something like and ?? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk commented on pull request #38782: [SPARK-38728][SQL] Test the error class: FAILED_RENAME_PATH
MaxGekk commented on PR #38782: URL: https://github.com/apache/spark/pull/38782#issuecomment-1325989625 cc @panbingkun @LuciferYang @itholic -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk commented on pull request #38766: [MINOR][SQL] Fix error message for `UNEXPECTED_INPUT_TYPE`
MaxGekk commented on PR #38766: URL: https://github.com/apache/spark/pull/38766#issuecomment-1325988492 +1, LGTM. Merged to master. Thank you, @itholic. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk closed pull request #38766: [MINOR][SQL] Fix error message for `UNEXPECTED_INPUT_TYPE`
MaxGekk closed pull request #38766: [MINOR][SQL] Fix error message for `UNEXPECTED_INPUT_TYPE` URL: https://github.com/apache/spark/pull/38766 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk commented on a diff in pull request #38784: [SPARK-41248][SQL] Add "spark.sql.json.enablePartialResults" to enable/disable JSON partial results parsing added in SPARK-40646
MaxGekk commented on code in PR #38784: URL: https://github.com/apache/spark/pull/38784#discussion_r1031078969 ## sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala: ## @@ -3629,6 +3629,15 @@ object SQLConf { .booleanConf .createWithDefault(true) + val JSON_ENABLE_PARTIAL_RESULTS = +buildConf("spark.sql.json.enablePartialResults") + .internal() + .doc("When set to true, enables partial results for structs, maps, and arrays in JSON " + +"when one or more fields do not match the schema") + .version("3.4.0") + .booleanConf + .createWithDefault(false) Review Comment: Could you re-gen results of `JsonBenchmark`. If there are no significant diffs, I am ok to enable it. Also, it would be nice to add one more benchmark for this particular case of partial results. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk commented on pull request #38772: [SPARK-41237][SQL] Assign a name to the error class `_LEGACY_ERROR_TEMP_0030`
MaxGekk commented on PR #38772: URL: https://github.com/apache/spark/pull/38772#issuecomment-1325981880 > Maybe do we want to consolidate them in one rule ? I agree. Let's use consolidate. I would prefer `DATATYPE` since it is shorter ;-) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] itholic commented on a diff in pull request #38769: [SPARK-41228][SQL] Rename & Improve error message for `COLUMN_NOT_IN_GROUP_BY_CLAUSE`.
itholic commented on code in PR #38769: URL: https://github.com/apache/spark/pull/38769#discussion_r1031072613 ## core/src/main/resources/error/error-classes.json: ## @@ -785,6 +779,13 @@ "Malformed Protobuf messages are detected in message deserialization. Parse Mode: . To process malformed protobuf message as null result, try setting the option 'mode' as 'PERMISSIVE'." ] }, + "MISSING_AGGREGATION" : { +"message" : [ + "The non-aggregating expression is based on columns which are not participating in the GROUP BY clause.", + "Add the columns or the expression to the GROUP BY, aggregate the expression, or use \"any_value()\" if you do not care which of the values within a group is returned." Review Comment: Cool. Let me update soon. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] itholic commented on pull request #38772: [SPARK-41237][SQL] Assign a name to the error class `_LEGACY_ERROR_TEMP_0030`
itholic commented on PR #38772: URL: https://github.com/apache/spark/pull/38772#issuecomment-1325973768 Btw, we use both `DATA_TYPE` and `DATATYPE` for error class name. Maybe do we want to consolidate them in one rule ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon closed pull request #38659: [SPARK-41114][CONNECT] Support local data for LocalRelation
HyukjinKwon closed pull request #38659: [SPARK-41114][CONNECT] Support local data for LocalRelation URL: https://github.com/apache/spark/pull/38659 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #38659: [SPARK-41114][CONNECT] Support local data for LocalRelation
HyukjinKwon commented on PR #38659: URL: https://github.com/apache/spark/pull/38659#issuecomment-1325972726 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] itholic commented on pull request #38772: [SPARK-41237][SQL] Assign a name to the error class `_LEGACY_ERROR_TEMP_0030`
itholic commented on PR #38772: URL: https://github.com/apache/spark/pull/38772#issuecomment-1325972686 > Can't you re-use the existing error class `UNSUPPORTED_DATATYPE`? Oh... I missed that one. Sounds good! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on pull request #38785: [SPARK-41249][SS] Add acceptance test for self-union on streaming query
HeartSaVioR commented on PR #38785: URL: https://github.com/apache/spark/pull/38785#issuecomment-1325972411 cc. @zsxwing @viirya Please take a look. Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR opened a new pull request, #38785: [SPARK-41249][SS] Add acceptance test for self-union on streaming query
HeartSaVioR opened a new pull request, #38785: URL: https://github.com/apache/spark/pull/38785 ### What changes were proposed in this pull request? This PR proposes to add a new test suite specifically for self-union tests on streaming query. The test cases are acceptance tests for 4 different cases, DSv1 vs DSv2 / DataStreamReader API vs table API. ### Why are the changes needed? This PR brings more test coverage on streaming workloads. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New test suite. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ulysses-you commented on a diff in pull request #38739: [SPARK-41207][SQL] Fix BinaryArithmetic with negative scale
ulysses-you commented on code in PR #38739: URL: https://github.com/apache/spark/pull/38739#discussion_r1031054191 ## sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/DecimalPrecisionSuite.scala: ## @@ -276,9 +276,9 @@ class DecimalPrecisionSuite extends AnalysisTest with BeforeAndAfter { val a = AttributeReference("a", DecimalType(3, -10))() val b = AttributeReference("b", DecimalType(1, -1))() val c = AttributeReference("c", DecimalType(35, 1))() - checkType(Multiply(a, b), DecimalType(5, -11)) - checkType(Multiply(a, c), DecimalType(38, -9)) - checkType(Multiply(b, c), DecimalType(37, 0)) + checkType(Multiply(a, b), DecimalType(16, 0)) Review Comment: Not sure what you found. I think the reason why Divide and IntegralDivide fail is simple. SQL strandard does not allow negative scale, but we use its definition formula to calculate the result precision and scale. Then the result precision can be negative which is unexpected. So I think other binary arithmetic also should not follow if scale is negative. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sadikovi commented on a diff in pull request #38784: [SPARK-41248] Add "spark.sql.json.enablePartialResults" to enable/disable JSON partial results parsing added in SPARK-40646
sadikovi commented on code in PR #38784: URL: https://github.com/apache/spark/pull/38784#discussion_r1031065321 ## sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala: ## @@ -3629,6 +3629,15 @@ object SQLConf { .booleanConf .createWithDefault(true) + val JSON_ENABLE_PARTIAL_RESULTS = +buildConf("spark.sql.json.enablePartialResults") + .internal() + .doc("When set to true, enables partial results for structs, maps, and arrays in JSON " + +"when one or more fields do not match the schema") + .version("3.4.0") + .booleanConf + .createWithDefault(false) Review Comment: I am still debating whether to keep this as `true` or `false`. On one hand, when enabled, it fixes the correctness issue of partially parsing JSON records. When disabled, it could cause performance issues (impact is yet to be confirmed). @MaxGekk I would love to know your thoughts on this. I would prefer to keep it enabled until the benchmark is produced that shows the regression. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sadikovi opened a new pull request, #38784: [SPARK-41248] Add "spark.sql.json.enablePartialResults" to enable/disable JSON partial results parsing added in SPARK-40646
sadikovi opened a new pull request, #38784: URL: https://github.com/apache/spark/pull/38784 ### What changes were proposed in this pull request? This PR adds a SQL config `spark.sql.json.enablePartialResults` to control SPARK-40646 change. This allows us to fall back to the behaviour before the change. It was observed that SPARK-40646 could cause a performance regression for deeply nested schemas. I, however, could not reproduce the regression with OSS benchmarks (maybe we need to extend them, I can do it as a follow-up). Regardless, I propose to add a SQL config to have an ability to disable the change in case of performance degradation during JSON parsing. ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? SQL config `spark.sql.json.enablePartialResults` is added to control the behaviour of SPARK-40646 JSON partial results parsing. Users can disable the feature if they find any performance regressions when reading JSON files. ### How was this patch tested? I extended existing unit tests to test with flag enabled and disabled. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gengliangwang commented on pull request #38779: [WIP][SPARK-41244][UI] Introducing a Protobuf serializer for UI data on KV store
gengliangwang commented on PR #38779: URL: https://github.com/apache/spark/pull/38779#issuecomment-1325963483 Pending on https://github.com/apache/spark/pull/38783 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gengliangwang commented on a diff in pull request #38779: [SPARK-41244][UI] Introducing a Protobuf serializer for UI data on KV store
gengliangwang commented on code in PR #38779: URL: https://github.com/apache/spark/pull/38779#discussion_r1031060603 ## core/pom.xml: ## @@ -532,7 +533,12 @@ org.apache.commons commons-crypto - + + com.google.protobuf Review Comment: Created https://github.com/apache/spark/pull/38783 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gengliangwang opened a new pull request, #38783: [SPARK-41247][BUILD] Unify the Protobuf versions in Spark connect and Protobuf connector
gengliangwang opened a new pull request, #38783: URL: https://github.com/apache/spark/pull/38783 ### What changes were proposed in this pull request? Unify the Protobuf versions in Spark connect and Protobuf connector. ### Why are the changes needed? The Protobuf dependencies will have the same behavior. And future upgrades will be easier. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? GA tests -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gengliangwang commented on a diff in pull request #38779: [SPARK-41244][UI] Introducing a Protobuf serializer for UI data on KV store
gengliangwang commented on code in PR #38779: URL: https://github.com/apache/spark/pull/38779#discussion_r1031058129 ## core/pom.xml: ## @@ -616,6 +621,50 @@ + +org.apache.maven.plugins +maven-shade-plugin + + false + true + + + com.google.protobuf:* + + + + + com.google.protobuf + ${spark.shade.packageName}.spark-core.protobuf + +com.google.protobuf.** + + + + + + +com.github.os72 +protoc-jar-maven-plugin +3.11.4 Review Comment: The plugin is already the latest: https://mvnrepository.com/artifact/com.github.os72/protoc-jar-maven-plugin But yes we unify the versions with the one in protobuf connector. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gengliangwang commented on a diff in pull request #38779: [SPARK-41244][UI] Introducing a Protobuf serializer for UI data on KV store
gengliangwang commented on code in PR #38779: URL: https://github.com/apache/spark/pull/38779#discussion_r1031056472 ## core/pom.xml: ## @@ -532,7 +533,12 @@ org.apache.commons commons-crypto - + + com.google.protobuf Review Comment: This is for the core module. Do you mean put the changes of the other pom files into an individual PR? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #38750: [SPARK-41226][SQL] Refactor Spark types by introducing physical types
cloud-fan commented on PR #38750: URL: https://github.com/apache/spark/pull/38750#issuecomment-1325955535 cc @MaxGekk @gengliangwang -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #38750: [SPARK-41226][SQL] Refactor Spark types by introducing physical types
cloud-fan commented on PR #38750: URL: https://github.com/apache/spark/pull/38750#issuecomment-1325955422 nice refactor! We should have done this earlier, before adding ansi interval types and timestamp ntz. Now we should have more confidence of these new data types. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #38750: [SPARK-41226][SQL] Refactor Spark types by introducing physical types
cloud-fan commented on code in PR #38750: URL: https://github.com/apache/spark/pull/38750#discussion_r1031053828 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/types/PhysicalDataType.scala: ## @@ -0,0 +1,58 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.types + +import org.apache.spark.sql.types._ + +sealed abstract class PhysicalDataType {} + +case class PhysicalArrayType(elementType: DataType, containsNull: Boolean) +extends PhysicalDataType {} + +case class PhysicalBinaryType() extends PhysicalDataType {} + +case class PhysicalBooleanType() extends PhysicalDataType {} + +case class PhysicalByteType() extends PhysicalDataType {} + +case class PhysicalCalendarIntervalType() extends PhysicalDataType {} + +case class PhysicalDecimalType(precision: Int, scale: Int) extends PhysicalDataType {} + +case class PhysicalDoubleType() extends PhysicalDataType {} + +case class PhysicalFloatType() extends PhysicalDataType {} + +case class PhysicalIntegerType() extends PhysicalDataType {} + +case class PhysicalLongType() extends PhysicalDataType {} + +case class PhysicalMapType(keyType: DataType, valueType: DataType, valueContainsNull: Boolean) +extends PhysicalDataType {} + +case class PhysicalNullType() extends PhysicalDataType {} + +case class PhysicalObjectType(cls: Class[_]) extends PhysicalDataType {} Review Comment: do we really need this? I feel it's like `UserDefinedType` and we should never get its physical type. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ulysses-you commented on a diff in pull request #38739: [SPARK-41207][SQL] Fix BinaryArithmetic with negative scale
ulysses-you commented on code in PR #38739: URL: https://github.com/apache/spark/pull/38739#discussion_r1031054191 ## sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/DecimalPrecisionSuite.scala: ## @@ -276,9 +276,9 @@ class DecimalPrecisionSuite extends AnalysisTest with BeforeAndAfter { val a = AttributeReference("a", DecimalType(3, -10))() val b = AttributeReference("b", DecimalType(1, -1))() val c = AttributeReference("c", DecimalType(35, 1))() - checkType(Multiply(a, b), DecimalType(5, -11)) - checkType(Multiply(a, c), DecimalType(38, -9)) - checkType(Multiply(b, c), DecimalType(37, 0)) + checkType(Multiply(a, b), DecimalType(16, 0)) Review Comment: Not sure what you found. I think the reason why Divide and IntegralDivide fail is simple. SQL strandard does not allow negative scale, but we use its definition formula to calculate the result precision and scale. Then the result precision can be nagetive which is unexpected. So I think other binary arithmetic also should not follow if scale is negative. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #38750: [SPARK-41226][SQL] Refactor Spark types by introducing physical types
cloud-fan commented on code in PR #38750: URL: https://github.com/apache/spark/pull/38750#discussion_r1031054150 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/types/PhysicalDataType.scala: ## @@ -0,0 +1,58 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.types + +import org.apache.spark.sql.types._ + +sealed abstract class PhysicalDataType {} + +case class PhysicalArrayType(elementType: DataType, containsNull: Boolean) +extends PhysicalDataType {} + +case class PhysicalBinaryType() extends PhysicalDataType {} Review Comment: should they be scala `object`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #38750: [SPARK-41226][SQL] Refactor Spark types by introducing physical types
cloud-fan commented on code in PR #38750: URL: https://github.com/apache/spark/pull/38750#discussion_r1031053475 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/types/PhysicalDataType.scala: ## @@ -0,0 +1,58 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.types + +import org.apache.spark.sql.types._ + +sealed abstract class PhysicalDataType {} Review Comment: nit: in scala we can omit `{}` if it's empty. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #38750: [SPARK-41226][SQL] Refactor Spark types by introducing physical types
cloud-fan commented on code in PR #38750: URL: https://github.com/apache/spark/pull/38750#discussion_r1031052882 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala: ## @@ -1901,24 +1904,27 @@ object CodeGenerator extends Logging { * Returns the Java type for a DataType. */ def javaType(dt: DataType): String = dt match { -case BooleanType => JAVA_BOOLEAN -case ByteType => JAVA_BYTE -case ShortType => JAVA_SHORT -case IntegerType | DateType | _: YearMonthIntervalType => JAVA_INT -case LongType | TimestampType | TimestampNTZType | _: DayTimeIntervalType => JAVA_LONG -case FloatType => JAVA_FLOAT -case DoubleType => JAVA_DOUBLE -case _: DecimalType => "Decimal" -case BinaryType => "byte[]" -case StringType => "UTF8String" -case CalendarIntervalType => "CalendarInterval" -case _: StructType => "InternalRow" -case _: ArrayType => "ArrayData" -case _: MapType => "MapData" case udt: UserDefinedType[_] => javaType(udt.sqlType) -case ObjectType(cls) if cls.isArray => s"${javaType(ObjectType(cls.getComponentType))}[]" -case ObjectType(cls) => cls.getName -case _ => "Object" +case _ => dt.physicalDataType match { Review Comment: nit: ``` dt.physicalDataType match { ... } ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #38750: [SPARK-41226][SQL] Refactor Spark types by introducing physical types
cloud-fan commented on code in PR #38750: URL: https://github.com/apache/spark/pull/38750#discussion_r1031051704 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/RowEncoder.scala: ## @@ -253,13 +255,16 @@ object RowEncoder { } case _: DayTimeIntervalType => ObjectType(classOf[java.time.Duration]) case _: YearMonthIntervalType => ObjectType(classOf[java.time.Period]) -case _: DecimalType => ObjectType(classOf[java.math.BigDecimal]) -case StringType => ObjectType(classOf[java.lang.String]) -case _: ArrayType => ObjectType(classOf[scala.collection.Seq[_]]) -case _: MapType => ObjectType(classOf[scala.collection.Map[_, _]]) -case _: StructType => ObjectType(classOf[Row]) case p: PythonUserDefinedType => externalDataTypeFor(p.sqlType) case udt: UserDefinedType[_] => ObjectType(udt.userClass) +case _ => dt.physicalDataType match { + case _: PhysicalArrayType => ObjectType(classOf[scala.collection.Seq[_]]) + case _: PhysicalDecimalType => ObjectType(classOf[java.math.BigDecimal]) + case _: PhysicalMapType => ObjectType(classOf[scala.collection.Map[_, _]]) + case _: PhysicalStringType => ObjectType(classOf[java.lang.String]) + case _: PhysicalStructType => ObjectType(classOf[Row]) + case _ => dt Review Comment: ditto, add some comments ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/RowEncoder.scala: ## @@ -358,6 +363,7 @@ object RowEncoder { If(IsNull(input), Literal.create(null, externalDataTypeFor(input.dataType)), CreateExternalRow(convertedFields, schema)) +case _ => input Review Comment: ditto -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #38750: [SPARK-41226][SQL] Refactor Spark types by introducing physical types
cloud-fan commented on code in PR #38750: URL: https://github.com/apache/spark/pull/38750#discussion_r1031051302 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/RowEncoder.scala: ## @@ -214,6 +215,7 @@ object RowEncoder { } else { nonNullOutput } +case _ => inputObject Review Comment: can we add a comment? `For other data types, just return the internal catalyst value as it is.` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #38750: [SPARK-41226][SQL] Refactor Spark types by introducing physical types
cloud-fan commented on code in PR #38750: URL: https://github.com/apache/spark/pull/38750#discussion_r1031050991 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/InternalRow.scala: ## @@ -129,24 +130,25 @@ object InternalRow { */ def getAccessor(dt: DataType, nullable: Boolean = true): (SpecializedGetters, Int) => Any = { val getValueNullSafe: (SpecializedGetters, Int) => Any = dt match { Review Comment: instead of ``` dt match { case _ => dt.physicalDataType match ... } ``` we can just do ``` dt.physicalDataType match ... ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ibuder opened a new pull request, #38782: [SPARK-38728][SQL] Test the error class: FAILED_RENAME_PATH
ibuder opened a new pull request, #38782: URL: https://github.com/apache/spark/pull/38782 ### What changes were proposed in this pull request? This adds a test for error class FAILED_RENAME_PATH in QueryExecutionErrorsSuite. ### Why are the changes needed? @MaxGekk this addresses https://issues.apache.org/jira/browse/SPARK-38728 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? The only change is the addition of the FAILED_RENAME_PATH test. ### License This contribution is my original work. I license the work to the project under the project’s open source license. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #38760: [SPARK-41219][SQL] Decimal changePrecision should work with decimal(0, 0)
cloud-fan commented on code in PR #38760: URL: https://github.com/apache/spark/pull/38760#discussion_r1031049024 ## sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala: ## @@ -420,7 +420,11 @@ final class Decimal extends Ordered[Decimal] with Serializable { // have overflowed our Long; in either case we must rescale dv to the new scale. dv = dv.setScale(scale, roundMode) if (dv.precision > precision) { Review Comment: I think this makes sense. If we write `SELECT 0bd` in Spark, the returned decimal is also `decimal(1, 0)`. Maybe forbidding `decimal(0, 0)` is a better choice. What do you think @srielau ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Yikun commented on pull request #38780: [SPARK-41185][K8S][DOCS] Remove ARM limitation for YuniKorn from docs
Yikun commented on PR #38780: URL: https://github.com/apache/spark/pull/38780#issuecomment-1325947902 @wilfred-s Would you mind setting your github action accroding to note of https://github.com/apache/spark/pull/38780/checks?check_run_id=9680278988. cc @dongjoon-hyun @yangwwei -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] yorksity opened a new pull request, #38781: [SPARK-41246][core] Solve the problem of RddId negative
yorksity opened a new pull request, #38781: URL: https://github.com/apache/spark/pull/38781 ### What changes were proposed in this pull request? ### Why are the changes needed? solve the problem occurs in long running tasks, such as stream tasks ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #38640: [WIP][SPARK-41124][SQL][TEST] Add DSv2 PlanStabilitySuites
cloud-fan commented on PR #38640: URL: https://github.com/apache/spark/pull/38640#issuecomment-1325947062 > Actually, I'm happy to work on making parquet v2 tables available in a separate ticket/PR if you can give my some guidance. I tried to do it long time ago but failed as there are some design issues. We need to fully understand the use cases of `CREATE TABLE ... USING v1Source` and see how to make it work for v2 sources: 1. Just a name mapping, so that people can use table name instead of providing all the data source information every time. JDBC data source is a good example of it. 2. Schema cache. The data source may have a large cost to infer the data schema and need the Spark catalog to cache it. File source is a good example here. We also need to think about the semantic of `ALTER TABLE`, `REFRESH TABLE`, `df.write.mode("overwrite").saveAsTable`, etc. Some code references. For v1 source, we have a rule `FindDataSourceTable` to resolve table with v1 source. For v2 source, we probably should have a similar rule to resolve v2 source to `TableProvider`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ulysses-you commented on a diff in pull request #38760: [SPARK-41219][SQL] Decimal changePrecision should work with decimal(0, 0)
ulysses-you commented on code in PR #38760: URL: https://github.com/apache/spark/pull/38760#discussion_r1031044398 ## sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala: ## @@ -420,7 +420,11 @@ final class Decimal extends Ordered[Decimal] with Serializable { // have overflowed our Long; in either case we must rescale dv to the new scale. dv = dv.setScale(scale, roundMode) if (dv.precision > precision) { Review Comment: it's 1, can see the docs in `java.math.BigDecimal` https://user-images.githubusercontent.com/12025282/203693808-37715386-1fda-4134-aafc-b558ee55b7fe.png;> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on pull request #38768: [SPARK-41230][CONNECT][PYTHON] Remove `str` from Aggregate expression type
amaliujia commented on PR #38768: URL: https://github.com/apache/spark/pull/38768#issuecomment-1325935616 @zhengruifeng @grundprinzip can you take another look? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] wilfred-s opened a new pull request, #38780: [SPARK-41185][K8S][DOCS] Remove ARM limitation for YuniKorn from docs
wilfred-s opened a new pull request, #38780: URL: https://github.com/apache/spark/pull/38780 ### What changes were proposed in this pull request? Remove the limitations section from the K8s documentation for YuniKorn. ### Why are the changes needed? The limitation section is not correct. YuniKorn is fully supported from release 1.1.0 onwards. YuniKorn 1.1.0 is the release that is referenced in the documentation. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing tests. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #38779: [SPARK-41244][UI] Introducing a Protobuf serializer for UI data on KV store
cloud-fan commented on code in PR #38779: URL: https://github.com/apache/spark/pull/38779#discussion_r1031036283 ## core/pom.xml: ## @@ -616,6 +621,50 @@ + +org.apache.maven.plugins +maven-shade-plugin + + false + true + + + com.google.protobuf:* + + + + + com.google.protobuf + ${spark.shade.packageName}.spark-core.protobuf + +com.google.protobuf.** + + + + + + +com.github.os72 +protoc-jar-maven-plugin +3.11.4 Review Comment: This is shaded, I don't think the version matters to hadoop2? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #38760: [SPARK-41219][SQL] Decimal changePrecision should work with decimal(0, 0)
cloud-fan commented on code in PR #38760: URL: https://github.com/apache/spark/pull/38760#discussion_r1031035938 ## sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala: ## @@ -420,7 +420,11 @@ final class Decimal extends Ordered[Decimal] with Serializable { // have overflowed our Long; in either case we must rescale dv to the new scale. dv = dv.setScale(scale, roundMode) if (dv.precision > precision) { Review Comment: what is the precision of `BigDecimal` in this case? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #38760: [SPARK-41219][SQL] Decimal changePrecision should work with decimal(0, 0)
cloud-fan commented on code in PR #38760: URL: https://github.com/apache/spark/pull/38760#discussion_r1031035738 ## sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala: ## @@ -3537,6 +3537,12 @@ class DataFrameSuite extends QueryTest }.isEmpty) } } + + test("SPARK-41219: Decimal changePrecision should work with decimal(0, 0)") { +val df = Seq("0.5944910").toDF("a") +checkAnswer(df.selectExpr("cast(a as decimal(7,7)) div 100"), Row(0)) +checkAnswer(df.select(lit(BigDecimal(0)) as "c").selectExpr("cast(c as decimal(0,0))"), Row(0)) Review Comment: Now `cast(a as decimal(7,7)) div 100` fails and we want to fix it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on a diff in pull request #38768: [SPARK-41230][CONNECT][PYTHON] Remove `str` from Aggregate expression type
amaliujia commented on code in PR #38768: URL: https://github.com/apache/spark/pull/38768#discussion_r1031035663 ## python/pyspark/sql/connect/plan.py: ## @@ -558,29 +557,19 @@ def _repr_html_(self) -> str: class Aggregate(LogicalPlan): -MeasureType = Tuple["ExpressionOrString", str] -MeasuresType = Sequence[MeasureType] -OptMeasuresType = Optional[MeasuresType] - def __init__( self, child: Optional["LogicalPlan"], grouping_cols: List[Column], -measures: OptMeasuresType, Review Comment: This is a sequence now and I think it can be len=0 which is empty measures? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on a diff in pull request #38779: [SPARK-41244][UI] Introducing a Protobuf serializer for UI data on KV store
LuciferYang commented on code in PR #38779: URL: https://github.com/apache/spark/pull/38779#discussion_r1031033790 ## project/SparkBuild.scala: ## @@ -607,9 +607,25 @@ object SparkParallelTestGrouping { object Core { import scala.sys.process.Process + import BuildCommons.protoVersion def buildenv = Process(Seq("uname")).!!.trim.replaceFirst("[^A-Za-z0-9].*", "").toLowerCase def bashpath = Process(Seq("where", "bash")).!!.split("[\r\n]+").head.replace('\\', '/') lazy val settings = Seq( +// Setting version for the protobuf compiler. This has to be propagated to every sub-project +// even if the project is not using it. +PB.protocVersion := BuildCommons.protoVersion, +// For some reason the resolution from the imported Maven build does not work for some +// of these dependendencies that we need to shade later on. +libraryDependencies ++= { + Seq( +"io.grpc" % "protoc-gen-grpc-java" % BuildCommons.gprcVersion asProtocPlugin(), Review Comment: Is grpc plugin required? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ulysses-you commented on pull request #38761: [SPARK-40988][SQL][TEST] Test case for insert partition should verify value
ulysses-you commented on PR #38761: URL: https://github.com/apache/spark/pull/38761#issuecomment-1325911082 thank you @rangareddy , it seems some wrong with github action. Can you rebase your branch to retry ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on a diff in pull request #38779: [SPARK-41244][UI] Introducing a Protobuf serializer for UI data on KV store
LuciferYang commented on code in PR #38779: URL: https://github.com/apache/spark/pull/38779#discussion_r1031027267 ## core/pom.xml: ## @@ -616,6 +621,50 @@ + +org.apache.maven.plugins +maven-shade-plugin + + false + true + + + com.google.protobuf:* + + + + + com.google.protobuf + ${spark.shade.packageName}.spark-core.protobuf + +com.google.protobuf.** + + + + + + +com.github.os72 +protoc-jar-maven-plugin +3.11.4 Review Comment: Need to check the impact on hadoop-2, although Spark 3.4 will not release hadoop-2 distribution, but CI is still running -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #38779: [SPARK-41244][UI] Introducing a Protobuf serializer for UI data on KV store
cloud-fan commented on code in PR #38779: URL: https://github.com/apache/spark/pull/38779#discussion_r1031025029 ## core/pom.xml: ## @@ -616,6 +621,50 @@ + +org.apache.maven.plugins +maven-shade-plugin + + false + true + + + com.google.protobuf:* + + + + + com.google.protobuf + ${spark.shade.packageName}.spark-core.protobuf + +com.google.protobuf.** + + + + + + +com.github.os72 +protoc-jar-maven-plugin +3.11.4 Review Comment: shall we pick `3.21.9` as it was used in the protobuf module? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #38779: [SPARK-41244][UI] Introducing a Protobuf serializer for UI data on KV store
cloud-fan commented on code in PR #38779: URL: https://github.com/apache/spark/pull/38779#discussion_r1031025176 ## core/pom.xml: ## @@ -532,7 +533,12 @@ org.apache.commons commons-crypto - + + com.google.protobuf Review Comment: Can we have individual PR for this? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on a diff in pull request #38742: [SPARK-41216][CONNECT][PYTHON] Make AnalyzePlan support multiple analysis tasks And implement isLocal/isStreaming/printSchema/
zhengruifeng commented on code in PR #38742: URL: https://github.com/apache/spark/pull/38742#discussion_r1031023071 ## connector/connect/src/main/protobuf/spark/connect/base.proto: ## @@ -100,18 +70,138 @@ message AnalyzePlanRequest { // logging purposes and will not be interpreted by the server. optional string client_type = 4; - // (Optional) Get the explain string of the plan. - Explain explain = 5; + repeated AnalysisTask tasks = 5; + + message AnalysisTask { +oneof task { + // Get the schema + Schema schema = 1; + + // Is local + IsLocal is_local = 2; + + // Is Streaming + IsStreaming is_streaming = 3; + + // Get the explain string of the plan. + Explain explain = 4; + + // Get the tree string of the schema. + TreeString tree_string = 5; + + // Get the input files. + InputFiles input_files = 6; + + // Get the semantic hash + SemanticHash semantic_hash = 7; + + // Check whether plans are equal. Review Comment: will remove semantic_hash and same_semantics since they are developer apis, although they were also in pyspark ## connector/connect/src/main/protobuf/spark/connect/base.proto: ## @@ -100,18 +70,138 @@ message AnalyzePlanRequest { // logging purposes and will not be interpreted by the server. optional string client_type = 4; - // (Optional) Get the explain string of the plan. - Explain explain = 5; + repeated AnalysisTask tasks = 5; + + message AnalysisTask { +oneof task { + // Get the schema + Schema schema = 1; + + // Is local + IsLocal is_local = 2; + + // Is Streaming + IsStreaming is_streaming = 3; + + // Get the explain string of the plan. + Explain explain = 4; + + // Get the tree string of the schema. + TreeString tree_string = 5; + + // Get the input files. + InputFiles input_files = 6; + + // Get the semantic hash + SemanticHash semantic_hash = 7; Review Comment: will remove it -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on a diff in pull request #38742: [SPARK-41216][CONNECT][PYTHON] Make AnalyzePlan support multiple analysis tasks And implement isLocal/isStreaming/printSchema/
zhengruifeng commented on code in PR #38742: URL: https://github.com/apache/spark/pull/38742#discussion_r1031022749 ## connector/connect/src/main/protobuf/spark/connect/base.proto: ## @@ -100,18 +70,138 @@ message AnalyzePlanRequest { // logging purposes and will not be interpreted by the server. optional string client_type = 4; - // (Optional) Get the explain string of the plan. - Explain explain = 5; + repeated AnalysisTask tasks = 5; + + message AnalysisTask { +oneof task { + // Get the schema + Schema schema = 1; + + // Is local + IsLocal is_local = 2; + + // Is Streaming + IsStreaming is_streaming = 3; + + // Get the explain string of the plan. + Explain explain = 4; + + // Get the tree string of the schema. + TreeString tree_string = 5; + + // Get the input files. + InputFiles input_files = 6; + + // Get the semantic hash + SemanticHash semantic_hash = 7; + + // Check whether plans are equal. + SameSemantics same_semantics = 8; +} + } + + // Analyze the input plan and return the schema. + message Schema { } + + // Returns true if the `collect` and `take` methods can be run locally. + message IsLocal { } + + // Returns true if this Dataset contains one or more sources that continuously + // return data as it arrives. + message IsStreaming { } + + // Explains the input plan based on a configurable mode. + message Explain { +// Plan explanation mode. +enum ExplainMode { + MODE_UNSPECIFIED = 0; + + // Generates only physical plan. + SIMPLE = 1; + + // Generates parsed logical plan, analyzed logical plan, optimized logical plan and physical plan. + // Parsed Logical plan is a unresolved plan that extracted from the query. Analyzed logical plans + // transforms which translates unresolvedAttribute and unresolvedRelation into fully typed objects. + // The optimized logical plan transforms through a set of optimization rules, resulting in the + // physical plan. + EXTENDED = 2; + + // Generates code for the statement, if any and a physical plan. + CODEGEN = 3; + + // If plan node statistics are available, generates a logical plan and also the statistics. + COST = 4; + + // Generates a physical plan outline and also node details. + FORMATTED = 5; +} + +// (Required) For analyzePlan rpc calls, configure the mode to explain plan in strings. +ExplainMode explain_mode= 1; + } + + // Generate a string to express the schema in a nice tree format. + // It will invoke 'StructType.treeString' (same as 'Dataset.printSchema') + // to compute the results. + message TreeString { + +// (Optional) The level to generate the string. +optional int32 level = 1; + } + + // Returns a best-effort snapshot of the files that compose this Dataset. + // It will invoke 'Dataset.inputFiles' to compute the results. + message InputFiles { } Review Comment: it had some usages anyway -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on a diff in pull request #38742: [SPARK-41216][CONNECT][PYTHON] Make AnalyzePlan support multiple analysis tasks And implement isLocal/isStreaming/printSchema/
zhengruifeng commented on code in PR #38742: URL: https://github.com/apache/spark/pull/38742#discussion_r1031021916 ## python/pyspark/sql/connect/dataframe.py: ## @@ -797,6 +796,137 @@ def schema(self) -> StructType: else: return self._schema +@property +def isLocal(self) -> bool: +"""Returns ``True`` if the :func:`collect` and :func:`take` methods can be run locally +(without any Spark executors). + +.. versionadded:: 3.4.0 + +Returns +--- +bool +""" +if self._plan is None: +raise Exception("Cannot analyze on empty plan.") +query = self._plan.to_proto(self._session) +return self._session.is_local(query) + +@property +def isStreaming(self) -> bool: +"""Returns ``True`` if this :class:`DataFrame` contains one or more sources that +continuously return data as it arrives. A :class:`DataFrame` that reads data from a +streaming source must be executed as a :class:`StreamingQuery` using the :func:`start` +method in :class:`DataStreamWriter`. Methods that return a single answer, (e.g., +:func:`count` or :func:`collect`) will throw an :class:`AnalysisException` when there +is a streaming source present. + +.. versionadded:: 3.4.0 + +Notes +- +This API is evolving. + +Returns +--- +bool +Whether it's streaming DataFrame or not. +""" +if self._plan is None: +raise Exception("Cannot analyze on empty plan.") +query = self._plan.to_proto(self._session) +return self._session.is_streaming(query) + +def printSchema(self) -> None: +"""Prints out the schema in the tree format. + +.. versionadded:: 3.4.0 + +Returns +--- +None +""" +if self._plan is None: +raise Exception("Cannot analyze on empty plan.") +query = self._plan.to_proto(self._session) +print(self._session.tree_string(query)) + +def semanticHash(self) -> int: Review Comment: oh, I did not notice that, I am fine to remove `sameSemantics` and `semanticHash` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #38743: [SPARK-41215][BUILD][PROTOBUF] Support user configurable protoc executables when building Spark Protobuf.
LuciferYang commented on PR #38743: URL: https://github.com/apache/spark/pull/38743#issuecomment-1325892309 Congratulations @WolverineJiang -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] WolverineJiang commented on pull request #38743: [SPARK-41215][BUILD][PROTOBUF] Support user configurable protoc executables when building Spark Protobuf.
WolverineJiang commented on PR #38743: URL: https://github.com/apache/spark/pull/38743#issuecomment-1325891951 Thanks @HyukjinKwon @LuciferYang @AmplabJenkins~ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on a diff in pull request #38778: [SPARK-41227][CONNECT][PYTHON] Implement `DataFrame.crossJoin`
zhengruifeng commented on code in PR #38778: URL: https://github.com/apache/spark/pull/38778#discussion_r1031016264 ## python/pyspark/sql/tests/connect/test_connect_plan_only.py: ## @@ -58,6 +58,12 @@ def test_join_condition(self): )._plan.to_proto(self.connect) self.assertIsNotNone(plan.root.join.join_condition) +def test_crossjoin(self): +left_input = self.connect.readTable(table_name=self.tbl_name) +right_input = self.connect.readTable(table_name=self.tbl_name) +plan = left_input.crossJoin(other=right_input)._plan.to_proto(self.connect) +self.assertEqual(plan.root.join.join_type, 7) # JOIN_TYPE_CROSS + Review Comment: after this PR, `DataFrame.join` also support `cross`, I think we can also add a test for it -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon closed pull request #38743: [SPARK-41215][BUILD][PROTOBUF] Support user configurable protoc executables when building Spark Protobuf.
HyukjinKwon closed pull request #38743: [SPARK-41215][BUILD][PROTOBUF] Support user configurable protoc executables when building Spark Protobuf. URL: https://github.com/apache/spark/pull/38743 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #38743: [SPARK-41215][BUILD][PROTOBUF] Support user configurable protoc executables when building Spark Protobuf.
HyukjinKwon commented on PR #38743: URL: https://github.com/apache/spark/pull/38743#issuecomment-1325885697 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #38742: [SPARK-41216][CONNECT][PYTHON] Make AnalyzePlan support multiple analysis tasks And implement isLocal/isStreaming/printSchema/sem
cloud-fan commented on code in PR #38742: URL: https://github.com/apache/spark/pull/38742#discussion_r1031014545 ## python/pyspark/sql/connect/dataframe.py: ## @@ -797,6 +796,137 @@ def schema(self) -> StructType: else: return self._schema +@property +def isLocal(self) -> bool: +"""Returns ``True`` if the :func:`collect` and :func:`take` methods can be run locally +(without any Spark executors). + +.. versionadded:: 3.4.0 + +Returns +--- +bool +""" +if self._plan is None: +raise Exception("Cannot analyze on empty plan.") +query = self._plan.to_proto(self._session) +return self._session.is_local(query) + +@property +def isStreaming(self) -> bool: +"""Returns ``True`` if this :class:`DataFrame` contains one or more sources that +continuously return data as it arrives. A :class:`DataFrame` that reads data from a +streaming source must be executed as a :class:`StreamingQuery` using the :func:`start` +method in :class:`DataStreamWriter`. Methods that return a single answer, (e.g., +:func:`count` or :func:`collect`) will throw an :class:`AnalysisException` when there +is a streaming source present. + +.. versionadded:: 3.4.0 + +Notes +- +This API is evolving. + +Returns +--- +bool +Whether it's streaming DataFrame or not. +""" +if self._plan is None: +raise Exception("Cannot analyze on empty plan.") +query = self._plan.to_proto(self._session) +return self._session.is_streaming(query) + +def printSchema(self) -> None: +"""Prints out the schema in the tree format. + +.. versionadded:: 3.4.0 + +Returns +--- +None +""" +if self._plan is None: +raise Exception("Cannot analyze on empty plan.") +query = self._plan.to_proto(self._session) +print(self._session.tree_string(query)) + +def semanticHash(self) -> int: Review Comment: This is a developer API in Dataset, do we really need to provide it in Spark connect? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #38567: [SPARK-41054][UI][CORE] Support RocksDB as KVStore in live UI
dongjoon-hyun commented on code in PR #38567: URL: https://github.com/apache/spark/pull/38567#discussion_r1031012684 ## core/src/main/scala/org/apache/spark/status/AppStatusStore.scala: ## @@ -769,7 +772,12 @@ private[spark] object AppStatusStore { def createLiveStore( conf: SparkConf, appStatusSource: Option[AppStatusSource] = None): AppStatusStore = { -val store = new ElementTrackingStore(new InMemoryStore(), conf) +val storePath = conf.get(LIVE_UI_LOCAL_STORE_DIR).map(new File(_)) +// For the disk-based KV store of live UI, let's simply make it ROCKSDB only for now, +// instead of supporting both LevelDB and RocksDB. RocksDB is built based on LevelDB with Review Comment: Got it. Thank you for confirming that. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on pull request #38757: [SPARK-41222][CONNECT][PYTHON] Unify the typing definitions
zhengruifeng commented on PR #38757: URL: https://github.com/apache/spark/pull/38757#issuecomment-1325877024 merged into master -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng closed pull request #38757: [SPARK-41222][CONNECT][PYTHON] Unify the typing definitions
zhengruifeng closed pull request #38757: [SPARK-41222][CONNECT][PYTHON] Unify the typing definitions URL: https://github.com/apache/spark/pull/38757 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on a diff in pull request #38742: [SPARK-41216][CONNECT][PYTHON] Make AnalyzePlan support multiple analysis tasks And implement isLocal/isStreaming/printSchema/
zhengruifeng commented on code in PR #38742: URL: https://github.com/apache/spark/pull/38742#discussion_r1031003410 ## connector/connect/src/main/protobuf/spark/connect/base.proto: ## @@ -100,18 +70,138 @@ message AnalyzePlanRequest { // logging purposes and will not be interpreted by the server. optional string client_type = 4; - // (Optional) Get the explain string of the plan. - Explain explain = 5; + repeated AnalysisTask tasks = 5; + + message AnalysisTask { +oneof task { + // Get the schema + Schema schema = 1; + + // Is local + IsLocal is_local = 2; + + // Is Streaming + IsStreaming is_streaming = 3; + + // Get the explain string of the plan. + Explain explain = 4; + + // Get the tree string of the schema. + TreeString tree_string = 5; + + // Get the input files. + InputFiles input_files = 6; + + // Get the semantic hash + SemanticHash semantic_hash = 7; + + // Check whether plans are equal. + SameSemantics same_semantics = 8; +} + } + + // Analyze the input plan and return the schema. + message Schema { } + + // Returns true if the `collect` and `take` methods can be run locally. + message IsLocal { } + + // Returns true if this Dataset contains one or more sources that continuously + // return data as it arrives. + message IsStreaming { } + + // Explains the input plan based on a configurable mode. + message Explain { +// Plan explanation mode. +enum ExplainMode { + MODE_UNSPECIFIED = 0; + + // Generates only physical plan. + SIMPLE = 1; + + // Generates parsed logical plan, analyzed logical plan, optimized logical plan and physical plan. + // Parsed Logical plan is a unresolved plan that extracted from the query. Analyzed logical plans + // transforms which translates unresolvedAttribute and unresolvedRelation into fully typed objects. + // The optimized logical plan transforms through a set of optimization rules, resulting in the + // physical plan. + EXTENDED = 2; + + // Generates code for the statement, if any and a physical plan. + CODEGEN = 3; + + // If plan node statistics are available, generates a logical plan and also the statistics. + COST = 4; + + // Generates a physical plan outline and also node details. + FORMATTED = 5; +} + +// (Required) For analyzePlan rpc calls, configure the mode to explain plan in strings. +ExplainMode explain_mode= 1; + } + + // Generate a string to express the schema in a nice tree format. + // It will invoke 'StructType.treeString' (same as 'Dataset.printSchema') + // to compute the results. + message TreeString { + +// (Optional) The level to generate the string. Review Comment: will do -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] yabola commented on a diff in pull request #38560: [WIP][SPARK-38005][core] Support cleaning up merged shuffle files and state from external shuffle service
yabola commented on code in PR #38560: URL: https://github.com/apache/spark/pull/38560#discussion_r1030998613 ## core/src/main/scala/org/apache/spark/storage/BlockManagerMasterEndpoint.scala: ## @@ -327,43 +327,52 @@ class BlockManagerMasterEndpoint( } }.toSeq -// Find all shuffle blocks on executors that are no longer running -val blocksToDeleteByShuffleService = - new mutable.HashMap[BlockManagerId, mutable.HashSet[BlockId]] +var removeShuffleFromShuffleServicesFutures = Seq.empty[Future[Boolean]] if (externalShuffleServiceRemoveShuffleEnabled) { - mapOutputTracker.shuffleStatuses.get(shuffleId).foreach { shuffleStatus => -shuffleStatus.withMapStatuses { mapStatuses => - mapStatuses.foreach { mapStatus => -// Check if the executor has been deallocated -if (!blockManagerIdByExecutor.contains(mapStatus.location.executorId)) { - val blocksToDel = - shuffleManager.shuffleBlockResolver.getBlocksForShuffle(shuffleId, mapStatus.mapId) - if (blocksToDel.nonEmpty) { -val blocks = blocksToDeleteByShuffleService.getOrElseUpdate(mapStatus.location, - new mutable.HashSet[BlockId]) -blocks ++= blocksToDel + val shuffleClient = externalBlockStoreClient.get + // Find all shuffle blocks on executors that are no longer running + val blocksToDelete = new mutable.HashMap[BlockManagerId, mutable.HashSet[BlockId]] + mapOutputTracker.shuffleStatuses.get(shuffleId) match { +case Some(shuffleStatus) => + shuffleStatus.withMapStatuses { mapStatuses => +mapStatuses.foreach { mapStatus => + // Check if the executor has been deallocated + if (!blockManagerIdByExecutor.contains(mapStatus.location.executorId)) { +val blocksToDel = shuffleManager.shuffleBlockResolver + .getBlocksForShuffle(shuffleId, mapStatus.mapId) +if (blocksToDel.nonEmpty) { + val blocks = blocksToDelete.getOrElseUpdate(mapStatus.location, +new mutable.HashSet[BlockId]) + blocks ++= blocksToDel +} Review Comment: > What if the shuffle statuses are not exists ? I think it will not, please see [case match codes](https://github.com/apache/spark/pull/38560/files#diff-21bb40987f7d21c8d3d3de3212f388c210fb04a636c96bdef53cab6070c6aa46R336) ## core/src/main/scala/org/apache/spark/storage/BlockManagerMasterEndpoint.scala: ## @@ -327,43 +327,52 @@ class BlockManagerMasterEndpoint( } }.toSeq -// Find all shuffle blocks on executors that are no longer running -val blocksToDeleteByShuffleService = - new mutable.HashMap[BlockManagerId, mutable.HashSet[BlockId]] +var removeShuffleFromShuffleServicesFutures = Seq.empty[Future[Boolean]] if (externalShuffleServiceRemoveShuffleEnabled) { - mapOutputTracker.shuffleStatuses.get(shuffleId).foreach { shuffleStatus => -shuffleStatus.withMapStatuses { mapStatuses => - mapStatuses.foreach { mapStatus => -// Check if the executor has been deallocated -if (!blockManagerIdByExecutor.contains(mapStatus.location.executorId)) { - val blocksToDel = - shuffleManager.shuffleBlockResolver.getBlocksForShuffle(shuffleId, mapStatus.mapId) - if (blocksToDel.nonEmpty) { -val blocks = blocksToDeleteByShuffleService.getOrElseUpdate(mapStatus.location, - new mutable.HashSet[BlockId]) -blocks ++= blocksToDel + val shuffleClient = externalBlockStoreClient.get + // Find all shuffle blocks on executors that are no longer running + val blocksToDelete = new mutable.HashMap[BlockManagerId, mutable.HashSet[BlockId]] + mapOutputTracker.shuffleStatuses.get(shuffleId) match { +case Some(shuffleStatus) => + shuffleStatus.withMapStatuses { mapStatuses => +mapStatuses.foreach { mapStatus => + // Check if the executor has been deallocated + if (!blockManagerIdByExecutor.contains(mapStatus.location.executorId)) { +val blocksToDel = shuffleManager.shuffleBlockResolver + .getBlocksForShuffle(shuffleId, mapStatus.mapId) +if (blocksToDel.nonEmpty) { + val blocks = blocksToDelete.getOrElseUpdate(mapStatus.location, +new mutable.HashSet[BlockId]) + blocks ++= blocksToDel +} Review Comment: > What if the shuffle statuses are not exists ? I think it will not, please see [case match codes](https://github.com/apache/spark/pull/38560/files#diff-21bb40987f7d21c8d3d3de3212f388c210fb04a636c96bdef53cab6070c6aa46R336) -- This is an automated message from the Apache Git Service. To respond to
[GitHub] [spark] yabola commented on a diff in pull request #38560: [WIP][SPARK-38005][core] Support cleaning up merged shuffle files and state from external shuffle service
yabola commented on code in PR #38560: URL: https://github.com/apache/spark/pull/38560#discussion_r1023816561 ## common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java: ## @@ -654,8 +731,7 @@ public MergeStatuses finalizeShuffleMerge(FinalizeShuffleMerge msg) { // If no blocks pushed for the finalizeShuffleMerge shuffleMergeId then return // empty MergeStatuses but cleanup the older shuffleMergeId files. submitCleanupTask(() -> - closeAndDeleteOutdatedPartitions( - appAttemptShuffleMergeId, mergePartitionsInfo.shuffleMergePartitions)); + deleteCurrentShufflePartitions(appShuffleInfo, shuffleId, appAttemptShuffleMergeId)); Review Comment: I checked the `appAttemptShuffleMergeId` in the code before. I think if we want to delete partitions merged data, then we should delete the corresponding ShuffleMergeId in DB (Otherwise, inconsistency will occur when restoring shuffle info from db) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] panbingkun commented on pull request #38710: [SPARK-41179][SQL] Assign a name to the error class _LEGACY_ERROR_TEMP_1092
panbingkun commented on PR #38710: URL: https://github.com/apache/spark/pull/38710#issuecomment-1325855794 > @panbingkun Please, resolve conflicts. Done. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on a diff in pull request #38742: [SPARK-41216][CONNECT][PYTHON] Make AnalyzePlan support multiple analysis tasks And implement isLocal/isStreaming/printSchema/sem
amaliujia commented on code in PR #38742: URL: https://github.com/apache/spark/pull/38742#discussion_r1030992346 ## connector/connect/src/main/protobuf/spark/connect/base.proto: ## @@ -100,18 +70,138 @@ message AnalyzePlanRequest { // logging purposes and will not be interpreted by the server. optional string client_type = 4; - // (Optional) Get the explain string of the plan. - Explain explain = 5; + repeated AnalysisTask tasks = 5; + + message AnalysisTask { +oneof task { + // Get the schema + Schema schema = 1; + + // Is local + IsLocal is_local = 2; + + // Is Streaming + IsStreaming is_streaming = 3; + + // Get the explain string of the plan. + Explain explain = 4; + + // Get the tree string of the schema. + TreeString tree_string = 5; + + // Get the input files. + InputFiles input_files = 6; + + // Get the semantic hash + SemanticHash semantic_hash = 7; + + // Check whether plans are equal. + SameSemantics same_semantics = 8; +} + } + + // Analyze the input plan and return the schema. + message Schema { } + + // Returns true if the `collect` and `take` methods can be run locally. + message IsLocal { } + + // Returns true if this Dataset contains one or more sources that continuously + // return data as it arrives. + message IsStreaming { } + + // Explains the input plan based on a configurable mode. + message Explain { +// Plan explanation mode. +enum ExplainMode { + MODE_UNSPECIFIED = 0; + + // Generates only physical plan. + SIMPLE = 1; + + // Generates parsed logical plan, analyzed logical plan, optimized logical plan and physical plan. + // Parsed Logical plan is a unresolved plan that extracted from the query. Analyzed logical plans + // transforms which translates unresolvedAttribute and unresolvedRelation into fully typed objects. + // The optimized logical plan transforms through a set of optimization rules, resulting in the + // physical plan. + EXTENDED = 2; + + // Generates code for the statement, if any and a physical plan. + CODEGEN = 3; + + // If plan node statistics are available, generates a logical plan and also the statistics. + COST = 4; + + // Generates a physical plan outline and also node details. + FORMATTED = 5; +} + +// (Required) For analyzePlan rpc calls, configure the mode to explain plan in strings. +ExplainMode explain_mode= 1; + } + + // Generate a string to express the schema in a nice tree format. + // It will invoke 'StructType.treeString' (same as 'Dataset.printSchema') + // to compute the results. + message TreeString { + +// (Optional) The level to generate the string. Review Comment: Document what is the default value? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on a diff in pull request #38742: [SPARK-41216][CONNECT][PYTHON] Make AnalyzePlan support multiple analysis tasks And implement isLocal/isStreaming/printSchema/
zhengruifeng commented on code in PR #38742: URL: https://github.com/apache/spark/pull/38742#discussion_r1030992308 ## connector/connect/src/main/protobuf/spark/connect/base.proto: ## @@ -100,18 +70,138 @@ message AnalyzePlanRequest { // logging purposes and will not be interpreted by the server. optional string client_type = 4; - // (Optional) Get the explain string of the plan. - Explain explain = 5; + repeated AnalysisTask tasks = 5; Review Comment: multiple analysis tasks is for this case: user can get all attributes in single RPC and then cache them for reusing. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on pull request #38777: [SPARK-41151][FOLLOW-UP][SQL] Keep built-in file _metadata fields nullable value consistent
HeartSaVioR commented on PR #38777: URL: https://github.com/apache/spark/pull/38777#issuecomment-1325844004 Ah OK, let's wait for feedback from @ala and ensure we make clear before merging it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Yaohua628 commented on pull request #38777: [SPARK-41151][FOLLOW-UP][SQL] Keep built-in file _metadata fields nullable value consistent
Yaohua628 commented on PR #38777: URL: https://github.com/apache/spark/pull/38777#issuecomment-1325843243 Thank you, Jungtaek! Also wanna confirm with @ala on nullability of `row_index` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on pull request #38770: [SPARK-41238][CONNECT][PYTHON] Support more datatypes
amaliujia commented on PR #38770: URL: https://github.com/apache/spark/pull/38770#issuecomment-1325840723 Thanks. Can you test both `nullable=true` and `nullable=false` case? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon closed pull request #38767: [SPARK-41183][SQL][FOLLOWUP] Change the name from injectPlanNormalizationRules to injectPlanNormalizationRule
HyukjinKwon closed pull request #38767: [SPARK-41183][SQL][FOLLOWUP] Change the name from injectPlanNormalizationRules to injectPlanNormalizationRule URL: https://github.com/apache/spark/pull/38767 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #38767: [SPARK-41183][SQL][FOLLOWUP] Change the name from injectPlanNormalizationRules to injectPlanNormalizationRule
HyukjinKwon commented on PR #38767: URL: https://github.com/apache/spark/pull/38767#issuecomment-1325839678 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org