[GitHub] [spark] EnricoMi commented on a diff in pull request #38223: [SPARK-40770][PYTHON] Improved error messages for applyInPandas for schema mismatch
EnricoMi commented on code in PR #38223: URL: https://github.com/apache/spark/pull/38223#discussion_r1015075283 ## python/pyspark/worker.py: ## @@ -188,22 +241,7 @@ def wrapped(key_series, value_series): elif len(argspec.args) == 2: key = tuple(s[0] for s in key_series) result = f(key, pd.concat(value_series, axis=1)) - -if not isinstance(result, pd.DataFrame): -raise TypeError( -"Return type of the user-defined function should be " -"pandas.DataFrame, but is {}".format(type(result)) -) -# the number of columns of result have to match the return type -# but it is fine for result to have no columns at all if it is empty -if not ( -len(result.columns) == len(return_type) or len(result.columns) == 0 and result.empty -): -raise RuntimeError( -"Number of columns of the returned pandas.DataFrame " -"doesn't match specified schema. " -"Expected: {} Actual: {}".format(len(return_type), len(result.columns)) -) +verify_pandas_result(result, return_type, assign_cols_by_name(runner_conf)) Review Comment: done -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] EnricoMi commented on a diff in pull request #38223: [SPARK-40770][PYTHON] Improved error messages for applyInPandas for schema mismatch
EnricoMi commented on code in PR #38223: URL: https://github.com/apache/spark/pull/38223#discussion_r1015075103 ## python/pyspark/sql/tests/pandas/test_pandas_cogrouped_map.py: ## @@ -165,100 +148,191 @@ def merge_pandas(lft, _): ) def test_apply_in_pandas_not_returning_pandas_dataframe(self): -left = self.data1 -right = self.data2 +self._test_merge_error( +fn=lambda lft, rgt: lft.size + rgt.size, +error_class=PythonException, +error_message_regex="Return type of the user-defined function " +"should be pandas.DataFrame, but is ", +) + +def test_apply_in_pandas_returning_column_names(self): +self._test_merge(fn=lambda lft, rgt: pd.merge(lft, rgt, on=["id", "k"])) +def test_apply_in_pandas_returning_no_column_names(self): def merge_pandas(lft, rgt): -return lft.size + rgt.size +res = pd.merge(lft, rgt, on=["id", "k"]) +res.columns = range(res.columns.size) +return res -with QuietTest(self.sc): -with self.assertRaisesRegex( -PythonException, -"Return type of the user-defined function should be pandas.DataFrame, " -"but is ", -): -( -left.groupby("id") -.cogroup(right.groupby("id")) -.applyInPandas(merge_pandas, "id long, k int, v int, v2 int") -.collect() -) +self._test_merge(fn=merge_pandas) -def test_apply_in_pandas_returning_wrong_number_of_columns(self): -left = self.data1 -right = self.data2 +def test_apply_in_pandas_returning_column_names_sometimes(self): +def merge_pandas(lft, rgt): +res = pd.merge(lft, rgt, on=["id", "k"]) +if 0 in lft["id"] and lft["id"][0] % 2 == 0: +return res +res.columns = range(res.columns.size) +return res + +self._test_merge(fn=merge_pandas) +def test_apply_in_pandas_returning_wrong_column_names(self): def merge_pandas(lft, rgt): if 0 in lft["id"] and lft["id"][0] % 2 == 0: lft["add"] = 0 if 0 in rgt["id"] and rgt["id"][0] % 3 == 0: rgt["more"] = 1 return pd.merge(lft, rgt, on=["id", "k"]) -with QuietTest(self.sc): -with self.assertRaisesRegex( -PythonException, -"Number of columns of the returned pandas.DataFrame " -"doesn't match specified schema. Expected: 4 Actual: 6", -): -( -# merge_pandas returns two columns for even keys while we set schema to four -left.groupby("id") -.cogroup(right.groupby("id")) -.applyInPandas(merge_pandas, "id long, k int, v int, v2 int") -.collect() -) +self._test_merge_error( +fn=merge_pandas, +error_class=PythonException, +error_message_regex="Column names of the returned pandas.DataFrame " +"do not match specified schema. Unexpected: add, more Schema: id, k, v, v2\n", +) -def test_apply_in_pandas_returning_empty_dataframe(self): -left = self.data1 -right = self.data2 +# with very large schema, missing and unexpected is limited to 5 +# and the schema is abbreviated in the error message +schema = "id long, k long, mean double, " + ", ".join( +f"column_with_long_column_name_{no} integer" for no in range(35) +) +self._test_merge_error( +fn=lambda lft, rgt: pd.DataFrame( +[ +( +lft.id, +lft.k, +lft.v.mean(), +) ++ tuple(lft.v.mean() for _ in range(7)) +], +columns=["id", "k", "mean"] + [f"extra_column_{no} integer" for no in range(7)], +), +output_schema=schema, +error_class=PythonException, +error_message_regex="Column names of the returned pandas\\.DataFrame " +"do not match specified schema\\. " +"Missing \\(first 5 of 35\\): column_with_long_column_name_0," +" column_with_long_column_name_1, column_with_long_column_name_10," +" column_with_long_column_name_11, column_with_long_column_name_12 " +"Unexpected \\(first 5 of 7\\): extra_column_0 integer, extra_column_1 integer," +" extra_column_2 integer, extra_column_3 integer, extra_column_4 integer " +"Schema: id, k, mean, column_with_long_column_name_0, column_with_long_column_name_1," +" column_with_long_column_name_2,
[GitHub] [spark] EnricoMi commented on pull request #38509: [SPARK-41014][PYTHON][DOC] Improve documentation and typing of groupby and cogroup applyInPandas
EnricoMi commented on PR #38509: URL: https://github.com/apache/spark/pull/38509#issuecomment-1305160983 @HyukjinKwon @zhengruifeng this are minor improvements for documentation and typing of PySpark applyInPandas methods. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] EnricoMi commented on pull request #38312: [SPARK-40819][SQL] Timestamp nanos behaviour regression
EnricoMi commented on PR #38312: URL: https://github.com/apache/spark/pull/38312#issuecomment-1305155414 @cloud-fan where do we stand with this? Is this a regression? How do we proceed? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #38519: [MINOR][SQL] Remove unused an error class and query error methods
LuciferYang commented on PR #38519: URL: https://github.com/apache/spark/pull/38519#issuecomment-1305147588 late LGTM -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk closed pull request #38519: [MINOR][SQL] Remove unused an error class and query error methods
MaxGekk closed pull request #38519: [MINOR][SQL] Remove unused an error class and query error methods URL: https://github.com/apache/spark/pull/38519 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk commented on pull request #38519: [MINOR][SQL] Remove unused an error class and query error methods
MaxGekk commented on PR #38519: URL: https://github.com/apache/spark/pull/38519#issuecomment-1305144565 Merging to master. Thank you, @itholic for review. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] panbingkun opened a new pull request, #38531: [SPARK-40755][SQL] Migrate type check failures of number formatting onto error classes
panbingkun opened a new pull request, #38531: URL: https://github.com/apache/spark/pull/38531 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on a diff in pull request #38318: [SPARK-40852][CONNECT][PYTHON] Introduce `StatFunction` in proto and implement `DataFrame.summary`
zhengruifeng commented on code in PR #38318: URL: https://github.com/apache/spark/pull/38318#discussion_r1015041619 ## python/pyspark/sql/connect/dataframe.py: ## @@ -323,6 +323,14 @@ def unionByName(self, other: "DataFrame", allowMissingColumns: bool = False) -> def where(self, condition: Expression) -> "DataFrame": return self.filter(condition) +def summary(self, *statistics: str) -> "DataFrame": +_statistics: List[str] = list(statistics) Review Comment: different from https://github.com/apache/spark/blob/29e45528315c43336ba60ee1717ef5ec0c001c00/python/pyspark/sql/dataframe.py#L2575-L2576 since i think that preprocessing weird -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on pull request #38318: [SPARK-40852][CONNECT][PYTHON] Introduce `StatFunction` in proto and implement `DataFrame.summary`
zhengruifeng commented on PR #38318: URL: https://github.com/apache/spark/pull/38318#issuecomment-1305136349 cc @cloud-fan @HyukjinKwon @amaliujia -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #38513: [SPARK-40903][SQL][FOLLOWUP] Cast canonicalized Add as its original data type if necessary
cloud-fan commented on PR #38513: URL: https://github.com/apache/spark/pull/38513#issuecomment-1305100077 I think canonicalization should not change the data type in the first place. Adding cast only hides the bug. What's worse, it doesn't help with the goal of canonicalization: match plans/expressions that are semantically equal, due to the extra cast. Can we be stricter on when we can reorder? e.g. add an allowlist and only reorder under certain cases, e.g. integral add with ansi off. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #38530: [SPARK-41027][SQL] Use `UNEXPECTED_INPUT_TYPE` instead of `MAP_FROM_ENTRIES_WRONG_TYPE`
LuciferYang commented on PR #38530: URL: https://github.com/apache/spark/pull/38530#issuecomment-1305095279 Yes, if the exception test lacks the corresponding UT, I will add one -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cxzl25 commented on pull request #38489: [SPARK-41003][SQL] BHJ LeftAnti does not update numOutputRows when codegen is disabled
cxzl25 commented on PR #38489: URL: https://github.com/apache/spark/pull/38489#issuecomment-1305078019 Help review. Thanks. @leanken-zz @cloud-fan @wangyum -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ulysses-you commented on a diff in pull request #38513: [SPARK-40903][SQL][FOLLOWUP] Cast canonicalized Add as its original data type if necessary
ulysses-you commented on code in PR #38513: URL: https://github.com/apache/spark/pull/38513#discussion_r1015003195 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala: ## @@ -481,12 +481,12 @@ case class Add( // TODO: do not reorder consecutive `Add`s with different `evalMode` val reorderResult = orderCommutative({ case Add(l, r, _) => Seq(l, r) }).reduce(Add(_, _, evalMode)) -if (resolved && reorderResult.resolved && reorderResult.dataType == dataType) { - reorderResult +if (resolved && reorderResult.resolved && reorderResult.dataType != dataType) { Review Comment: I'm not worry about new data types, just want to avoid unnecessary logic for some unrelated data types like integer, float .. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gengliangwang commented on a diff in pull request #38513: [SPARK-40903][SQL][FOLLOWUP] Cast canonicalized Add as its original data type if necessary
gengliangwang commented on code in PR #38513: URL: https://github.com/apache/spark/pull/38513#discussion_r101535 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala: ## @@ -481,12 +481,12 @@ case class Add( // TODO: do not reorder consecutive `Add`s with different `evalMode` val reorderResult = orderCommutative({ case Add(l, r, _) => Seq(l, r) }).reduce(Add(_, _, evalMode)) -if (resolved && reorderResult.resolved && reorderResult.dataType == dataType) { - reorderResult +if (resolved && reorderResult.resolved && reorderResult.dataType != dataType) { Review Comment: @ulysses-you the current code seems fine. If there is a new data type in the future, we can still avoid the issue. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gengliangwang commented on a diff in pull request #38513: [SPARK-40903][SQL][FOLLOWUP] Cast canonicalized Add as its original data type if necessary
gengliangwang commented on code in PR #38513: URL: https://github.com/apache/spark/pull/38513#discussion_r1014999188 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala: ## @@ -481,12 +481,12 @@ case class Add( // TODO: do not reorder consecutive `Add`s with different `evalMode` val reorderResult = orderCommutative({ case Add(l, r, _) => Seq(l, r) }).reduce(Add(_, _, evalMode)) -if (resolved && reorderResult.resolved && reorderResult.dataType == dataType) { - reorderResult +if (resolved && reorderResult.resolved && reorderResult.dataType != dataType) { + // SPARK-40903: Append cast for the canonicalization of decimal Add if the result data type is + // changed. Otherwise, it may cause data checking error within ComplexTypeMergingExpression. + Cast(reorderResult, dataType) Review Comment: Yes, the cast is better after seconds thought. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] itholic commented on a diff in pull request #38447: [SPARK-40973][SQL] Rename `_LEGACY_ERROR_TEMP_0055` to `UNCLOSED_BRACKETED_COMMENT`
itholic commented on code in PR #38447: URL: https://github.com/apache/spark/pull/38447#discussion_r1014998869 ## sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryParsingErrors.scala: ## @@ -608,8 +608,12 @@ private[sql] object QueryParsingErrors extends QueryErrorsBase { } def unclosedBracketedCommentError(command: String, position: Origin): Throwable = { -new ParseException(Some(command), "Unclosed bracketed comment", position, position, - Some("_LEGACY_ERROR_TEMP_0055")) +new ParseException( + Some(command), + "Found an unclosed bracketed comment. Please, append */ at the end of the comment.", + position, + position, + Some("UNCLOSED_BRACKETED_COMMENT")) Review Comment: Cool! Thanks for the suggestion 👍 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] itholic commented on pull request #38422: [SPARK-40948][SQL] Introduce new error class: PATH_NOT_FOUND
itholic commented on PR #38422: URL: https://github.com/apache/spark/pull/38422#issuecomment-1305067447 Thanks, @MaxGekk ! Just adjusted the comments. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] itholic commented on pull request #38530: [SPARK-41027][SQL] Use `UNEXPECTED_INPUT_TYPE` instead of `MAP_FROM_ENTRIES_WRONG_TYPE`
itholic commented on PR #38530: URL: https://github.com/apache/spark/pull/38530#issuecomment-1305066532 It would be great if we put some example into the PR description how the error message changes with `Before` and `After` example. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #38530: [SPARK-41027][SQL] Use `UNEXPECTED_INPUT_TYPE ` instead of `MAP_FROM_ENTRIES_WRONG_TYPE`
LuciferYang commented on PR #38530: URL: https://github.com/apache/spark/pull/38530#issuecomment-1305030163 cc @MaxGekk -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on a diff in pull request #38507: [SPARK-40372][SQL] Migrate failures of array type checks onto error classes
LuciferYang commented on code in PR #38507: URL: https://github.com/apache/spark/pull/38507#discussion_r1014970870 ## core/src/main/resources/error/error-classes.json: ## @@ -225,14 +230,14 @@ "The should all be of type map, but it's ." ] }, - "MAP_CONTAINS_KEY_DIFF_TYPES" : { + "MAP_FROM_ENTRIES_WRONG_TYPE" : { Review Comment: https://github.com/apache/spark/pull/38530 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on a diff in pull request #38507: [SPARK-40372][SQL] Migrate failures of array type checks onto error classes
LuciferYang commented on code in PR #38507: URL: https://github.com/apache/spark/pull/38507#discussion_r1014970870 ## core/src/main/resources/error/error-classes.json: ## @@ -225,14 +230,14 @@ "The should all be of type map, but it's ." ] }, - "MAP_CONTAINS_KEY_DIFF_TYPES" : { + "MAP_FROM_ENTRIES_WRONG_TYPE" : { Review Comment: https://github.com/apache/spark/pull/38507 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang opened a new pull request, #38530: [SPARK-41027][SQL] Use `UNEXPECTED_INPUT_TYPE ` instead of `MAP_FROM_ENTRIES_WRONG_TYPE`
LuciferYang opened a new pull request, #38530: URL: https://github.com/apache/spark/pull/38530 ### What changes were proposed in this pull request? This pr aims to use `UNEXPECTED_INPUT_TYPE ` instead of `MAP_FROM_ENTRIES_WRONG_TYPE` and remove definition of `MAP_FROM_ENTRIES_WRONG_TYPE` ### Why are the changes needed? Re-use `UNEXPECTED_INPUT_TYPE `. ### Does this PR introduce _any_ user-facing change? Yes. The PR changes user-facing error messages. ### How was this patch tested? Pass GA -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on a diff in pull request #38507: [SPARK-40372][SQL] Migrate failures of array type checks onto error classes
LuciferYang commented on code in PR #38507: URL: https://github.com/apache/spark/pull/38507#discussion_r1014968186 ## core/src/main/resources/error/error-classes.json: ## @@ -225,14 +230,14 @@ "The should all be of type map, but it's ." ] }, - "MAP_CONTAINS_KEY_DIFF_TYPES" : { + "MAP_FROM_ENTRIES_WRONG_TYPE" : { Review Comment: SPARK-41027, but if use `UNEXPECTED_INPUT_TYPE` , the `requiredType` should be `Array of pair Struct` ``` "requiredType" -> s"${toSQLType(ArrayType)} of pair ${toSQLType(StructType)}" ``` is it ok? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] jackylee-ch commented on a diff in pull request #38496: [WIP][SPARK-40708][SQL] Auto update table statistics based on write metrics
jackylee-ch commented on code in PR #38496: URL: https://github.com/apache/spark/pull/38496#discussion_r1014966247 ## sql/core/src/main/scala/org/apache/spark/sql/execution/command/CommandUtils.scala: ## @@ -53,19 +53,41 @@ class PathFilterIgnoreNonData(stagingDir: String) extends PathFilter with Serial object CommandUtils extends Logging { /** Change statistics after changing data by commands. */ - def updateTableStats(sparkSession: SparkSession, table: CatalogTable): Unit = { -val catalog = sparkSession.sessionState.catalog -if (sparkSession.sessionState.conf.autoSizeUpdateEnabled) { + def updateTableStats( + sparkSession: SparkSession, + table: CatalogTable, + wroteStats: Option[WriteStats] = None): Unit = { +val sessionState = sparkSession.sessionState +val catalog = sessionState.catalog +if (sessionState.conf.autoSizeUpdateEnabled) { val newTable = catalog.getTableMetadata(table.identifier) val (newSize, newPartitions) = CommandUtils.calculateTotalSize(sparkSession, newTable) - val isNewStats = newTable.stats.map(newSize != _.sizeInBytes).getOrElse(true) + val isNewStats = newTable.stats.forall(newSize != _.sizeInBytes) if (isNewStats) { val newStats = CatalogStatistics(sizeInBytes = newSize) catalog.alterTableStats(table.identifier, Some(newStats)) } if (newPartitions.nonEmpty) { catalog.alterPartitions(table.identifier, newPartitions) } +} else if (sessionState.conf.autoUpdateBasedOnMetricsEnabled && wroteStats.nonEmpty) { + val stat = wroteStats.get + stat.mode match { +case SaveMode.Overwrite | SaveMode.ErrorIfExists => + catalog.alterTableStats(table.identifier, +Some(CatalogStatistics(stat.numBytes, stat.numRows))) Review Comment: Hm, we should consider about partition Statistics here. If we overwrite the part of the partitions, it would get wrong table statistcs. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ulysses-you commented on a diff in pull request #38513: [SPARK-40903][SQL][FOLLOWUP] Cast canonicalized Add as its original data type if necessary
ulysses-you commented on code in PR #38513: URL: https://github.com/apache/spark/pull/38513#discussion_r1014959493 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala: ## @@ -481,12 +481,12 @@ case class Add( // TODO: do not reorder consecutive `Add`s with different `evalMode` val reorderResult = orderCommutative({ case Add(l, r, _) => Seq(l, r) }).reduce(Add(_, _, evalMode)) -if (resolved && reorderResult.resolved && reorderResult.dataType == dataType) { - reorderResult +if (resolved && reorderResult.resolved && reorderResult.dataType != dataType) { Review Comment: for safe, can we do this for decimal type only ? e.g ```scala left match { case _: DecimalType if resolved && reorderResult.resolved && reorderResult.dataType != dataType => Cast(reorderResult, dataType) case _ => reorderResult } ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on a diff in pull request #38507: [SPARK-40372][SQL] Migrate failures of array type checks onto error classes
LuciferYang commented on code in PR #38507: URL: https://github.com/apache/spark/pull/38507#discussion_r1014954543 ## core/src/main/resources/error/error-classes.json: ## @@ -225,14 +230,14 @@ "The should all be of type map, but it's ." ] }, - "MAP_CONTAINS_KEY_DIFF_TYPES" : { + "MAP_FROM_ENTRIES_WRONG_TYPE" : { Review Comment: @MaxGekk Are you talking about `MAP_FROM_ENTRIES_WRONG_TYPE`? This is not a new subclass introduced by this pr, but a new one created by https://github.com/apache/spark/pull/38197/files. Let me file a new jira to replace `MAP_FROM_ENTRIES_WRONG_TYPE` with `UNEXPECTED_INPUT_TYPE`, because it is not related to this pr -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on pull request #38395: [SPARK-40917][SQL] Add a dedicated logical plan for `Summary`
zhengruifeng commented on PR #38395: URL: https://github.com/apache/spark/pull/38395#issuecomment-1304998340 i am going to close this pr and continue work on https://github.com/apache/spark/pull/38318 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng closed pull request #38395: [SPARK-40917][SQL] Add a dedicated logical plan for `Summary`
zhengruifeng closed pull request #38395: [SPARK-40917][SQL] Add a dedicated logical plan for `Summary` URL: https://github.com/apache/spark/pull/38395 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] beliefer commented on pull request #37630: [SPARK-40193][SQL] Merge subquery plans with different filters
beliefer commented on PR #37630: URL: https://github.com/apache/spark/pull/37630#issuecomment-1304985806 @peter-toth Could you fix these conflicts. I want test this PR. Thank you! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] github-actions[bot] commented on pull request #36253: [SPARK-38932][SQL] Datasource v2 support report distinct keys
github-actions[bot] commented on PR #36253: URL: https://github.com/apache/spark/pull/36253#issuecomment-1304942022 We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] github-actions[bot] closed pull request #37065: [SPARK-38699][SQL] Use error classes in the execution errors of dictionary encoding
github-actions[bot] closed pull request #37065: [SPARK-38699][SQL] Use error classes in the execution errors of dictionary encoding URL: https://github.com/apache/spark/pull/37065 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] github-actions[bot] closed pull request #37317: [SPARK-39894][SQL] Combine the similar binary comparison in boolean expression.
github-actions[bot] closed pull request #37317: [SPARK-39894][SQL] Combine the similar binary comparison in boolean expression. URL: https://github.com/apache/spark/pull/37317 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] github-actions[bot] closed pull request #37235: [SPARK-39824][PYTHON][PS] Introduce index where and putmask func in pyspark
github-actions[bot] closed pull request #37235: [SPARK-39824][PYTHON][PS] Introduce index where and putmask func in pyspark URL: https://github.com/apache/spark/pull/37235 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] github-actions[bot] commented on pull request #37316: [SPARK-39893][SQL] Push limit 1 to the aggregate's child plan if grouping expressions and aggregate expressions are foldable
github-actions[bot] commented on PR #37316: URL: https://github.com/apache/spark/pull/37316#issuecomment-1304941982 We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] srowen commented on pull request #38510: [MINOR][DOC] revisions for spark sql performance tuning to improve readability and grammar
srowen commented on PR #38510: URL: https://github.com/apache/spark/pull/38510#issuecomment-1304934181 Merged to master -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] srowen closed pull request #38510: [MINOR][DOC] revisions for spark sql performance tuning to improve readability and grammar
srowen closed pull request #38510: [MINOR][DOC] revisions for spark sql performance tuning to improve readability and grammar URL: https://github.com/apache/spark/pull/38510 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #38507: [SPARK-40372][SQL] Migrate failures of array type checks onto error classes
LuciferYang commented on PR #38507: URL: https://github.com/apache/spark/pull/38507#issuecomment-1304929641 will update this one today -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #38523: [SPARK-41023][BUILD] Upgrade Jackson to 2.14.0
LuciferYang commented on PR #38523: URL: https://github.com/apache/spark/pull/38523#issuecomment-1304926628 > Shouldn't guava changes be done in a separate PR? Currently, I will not be considered upgrade Guava for the following reasons: 1. Although Spark 3.4.0 will no longer release the hadoop-2 distribution.sh, but the build and testing process is still going on. We need to keep it and will not consider upgrading Guava until it is really removed 2. In my impression, Hive 2.3 still dependency on the Guava 14.0.1, @sunchao has tried to solve this problem before, but it should not be finished and I don't know whether there are any plans for follow-up. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #38523: [SPARK-41023][BUILD] Upgrade Jackson to 2.14.0
LuciferYang commented on PR #38523: URL: https://github.com/apache/spark/pull/38523#issuecomment-1304926354 > @LuciferYang feel free to continue with your PR. I marked [SPARK-41023](https://issues.apache.org/jira/browse/SPARK-41023) as a duplicate of [SPARK-40911](https://issues.apache.org/jira/browse/SPARK-40911). Thanks ~ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #38523: [SPARK-41023][BUILD] Upgrade Jackson to 2.14.0
LuciferYang commented on PR #38523: URL: https://github.com/apache/spark/pull/38523#issuecomment-1304922578 @srowen test passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on a diff in pull request #38488: [SPARK-41002][CONNECT][PYTHON] Compatible `take`, `head` and `first` API in Python client
amaliujia commented on code in PR #38488: URL: https://github.com/apache/spark/pull/38488#discussion_r1013499840 ## python/pyspark/sql/connect/dataframe.py: ## @@ -211,14 +212,66 @@ def filter(self, condition: Expression) -> "DataFrame": plan.Filter(child=self._plan, filter=condition), session=self._session ) -def first(self) -> Optional["pandas.DataFrame"]: -return self.head(1) +def first(self) -> Optional[Row]: +"""Returns the first row as a :class:`Row`. + +.. versionadded:: 3.4.0 + +Returns +--- +:class:`Row` + First row if :class:`DataFrame` is not empty, otherwise ``None``. +""" +return self.head() Review Comment: ah actually we cannot. `self.head()` returns `Optional[Row]` but `self.head(n)` returns `List[Row]`. `self.head()` is to make sure mypy check pass. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on pull request #38529: [SPARK-41026][CONNECT] Support Repartition in Connect Proto
amaliujia commented on PR #38529: URL: https://github.com/apache/spark/pull/38529#issuecomment-1304913709 R: @cloud-fan -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia opened a new pull request, #38529: [SPARK-41026][CONNECT] Support Repartition in Connect Proto
amaliujia opened a new pull request, #38529: URL: https://github.com/apache/spark/pull/38529 ### What changes were proposed in this pull request? Support `Repartition` in Connect proto, which supports two API: `repartition` (shuffle=true) and `coalesce` (shuffle=false). ### Why are the changes needed? Improve API coverage. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? UT -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on pull request #38528: [SPARK-41025][SS] Introduce ComparableOffset to support offset range validation
HeartSaVioR commented on PR #38528: URL: https://github.com/apache/spark/pull/38528#issuecomment-1304913137 cc. @zsxwing @jerrypeng Appreciate your review. Thanks in advance! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR opened a new pull request, #38528: [SPARK-41025][SS] Introduce ComparableOffset to support offset range validation
HeartSaVioR opened a new pull request, #38528: URL: https://github.com/apache/spark/pull/38528 ### What changes were proposed in this pull request? This PR proposes to introduce a new interface ComparableOffset, which is a mix-in of streaming Offset interface to enable comparison between two offset instances. MicroBatchExecution will perform validation against offset range if the offset instance implements ComparableOffset. The new interface can be mixed-in with both DSv1 streaming Offset and DSv2 streaming Offset. This PR also implements this interface for streaming offset in built-in data sources. ### Why are the changes needed? Currently, Spark doesn't do any assertion against offsets and data source implementation is full of responsibility to validate the offset. It seems more useful to provide the offset validation by Spark rather than just documenting the responsibility and let data source implementation do the duty. This offset validation is more important since we have Trigger.AvailableNow which gradually increases the offset and terminates when the offset is equal to the desired offset. A bug in data source may stall the query progress or even data duplication. ### Does this PR introduce _any_ user-facing change? No for end users. ### How was this patch tested? New UTs -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ljfgem commented on a diff in pull request #35636: [SPARK-31357][SQL][WIP] Catalog API for view metadata
ljfgem commented on code in PR #35636: URL: https://github.com/apache/spark/pull/35636#discussion_r1014902247 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/V2ViewDescription.scala: ## @@ -0,0 +1,55 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.analysis + +import scala.collection.JavaConverters._ + +import org.apache.spark.sql.catalyst.plans.logical.ViewDescription +import org.apache.spark.sql.connector.catalog.{View, ViewCatalog} +import org.apache.spark.sql.types.StructType + +/** + * View description backed by a View in V2 catalog. + * + * @param view a view in V2 catalog + */ +case class V2ViewDescription( +override val identifier: String, +view: View) extends ViewDescription { + + override val schema: StructType = view.schema + + override val viewText: Option[String] = None Review Comment: Yeah I am fine with keep using `V2ViewDescription`, I don't think the change is necessary. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] jzhuge commented on a diff in pull request #35636: [SPARK-31357][SQL][WIP] Catalog API for view metadata
jzhuge commented on code in PR #35636: URL: https://github.com/apache/spark/pull/35636#discussion_r1014901610 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/V2ViewDescription.scala: ## @@ -0,0 +1,55 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.analysis + +import scala.collection.JavaConverters._ + +import org.apache.spark.sql.catalyst.plans.logical.ViewDescription +import org.apache.spark.sql.connector.catalog.{View, ViewCatalog} +import org.apache.spark.sql.types.StructType + +/** + * View description backed by a View in V2 catalog. + * + * @param view a view in V2 catalog + */ +case class V2ViewDescription( +override val identifier: String, +view: View) extends ViewDescription { + + override val schema: StructType = view.schema + + override val viewText: Option[String] = None Review Comment: Indeed, `V2ViewDescription` is an API adaptor for easy cloning of v1 SQL commands, and adding missing APIs like `comment` or `owner` (stored in properties on disk), etc. So keep using `V2ViewDescription`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ljfgem commented on a diff in pull request #35636: [SPARK-31357][SQL][WIP] Catalog API for view metadata
ljfgem commented on code in PR #35636: URL: https://github.com/apache/spark/pull/35636#discussion_r1014899090 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/V2ViewDescription.scala: ## @@ -0,0 +1,55 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.analysis + +import scala.collection.JavaConverters._ + +import org.apache.spark.sql.catalyst.plans.logical.ViewDescription +import org.apache.spark.sql.connector.catalog.{View, ViewCatalog} +import org.apache.spark.sql.types.StructType + +/** + * View description backed by a View in V2 catalog. + * + * @param view a view in V2 catalog + */ +case class V2ViewDescription( +override val identifier: String, +view: View) extends ViewDescription { + + override val schema: StructType = view.schema + + override val viewText: Option[String] = None Review Comment: Thanks for explaining. Do you think we also need to modify `ShowViewPropertiesExec` and `DescribeViewExec` which uses `V2ViewDescription`? Looks like using `V2ViewDescription` would bring some convenience especially for `DescribeViewExec`. I am good with both approach. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ljfgem commented on a diff in pull request #35636: [SPARK-31357][SQL][WIP] Catalog API for view metadata
ljfgem commented on code in PR #35636: URL: https://github.com/apache/spark/pull/35636#discussion_r1014899090 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/V2ViewDescription.scala: ## @@ -0,0 +1,55 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.analysis + +import scala.collection.JavaConverters._ + +import org.apache.spark.sql.catalyst.plans.logical.ViewDescription +import org.apache.spark.sql.connector.catalog.{View, ViewCatalog} +import org.apache.spark.sql.types.StructType + +/** + * View description backed by a View in V2 catalog. + * + * @param view a view in V2 catalog + */ +case class V2ViewDescription( +override val identifier: String, +view: View) extends ViewDescription { + + override val schema: StructType = view.schema + + override val viewText: Option[String] = None Review Comment: Thanks for explaining. Do you think we also need to modify `ShowViewPropertiesExec` and `DescribeViewExec` which uses `V2ViewDescription`? Looks like using `V2ViewDescription` would bring some convenience especially for `DescribeViewExec`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] jzhuge commented on a diff in pull request #35636: [SPARK-31357][SQL][WIP] Catalog API for view metadata
jzhuge commented on code in PR #35636: URL: https://github.com/apache/spark/pull/35636#discussion_r1014897627 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/V2ViewDescription.scala: ## @@ -0,0 +1,55 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.analysis + +import scala.collection.JavaConverters._ + +import org.apache.spark.sql.catalyst.plans.logical.ViewDescription +import org.apache.spark.sql.connector.catalog.{View, ViewCatalog} +import org.apache.spark.sql.types.StructType + +/** + * View description backed by a View in V2 catalog. + * + * @param view a view in V2 catalog + */ +case class V2ViewDescription( +override val identifier: String, +view: View) extends ViewDescription { + + override val schema: StructType = view.schema + + override val viewText: Option[String] = None Review Comment: > I found output parameter in ShowCreateViewExec(output: Seq[Attribute], desc: V2ViewDescription) is never used? It is needed by the abstract class `QueryPlan`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] jzhuge commented on a diff in pull request #35636: [SPARK-31357][SQL][WIP] Catalog API for view metadata
jzhuge commented on code in PR #35636: URL: https://github.com/apache/spark/pull/35636#discussion_r1014897367 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/V2ViewDescription.scala: ## @@ -0,0 +1,55 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.analysis + +import scala.collection.JavaConverters._ + +import org.apache.spark.sql.catalyst.plans.logical.ViewDescription +import org.apache.spark.sql.connector.catalog.{View, ViewCatalog} +import org.apache.spark.sql.types.StructType + +/** + * View description backed by a View in V2 catalog. + * + * @param view a view in V2 catalog + */ +case class V2ViewDescription( +override val identifier: String, +view: View) extends ViewDescription { + + override val schema: StructType = view.schema + + override val viewText: Option[String] = None Review Comment: `ShowCreateV2View` is for v2 view while `V2ViewDescription` is an adaptor between v1 and v2, so it makes more sense for `ShowCreateV2View` not to use `V2ViewDescription`, instead, relying on V2 interfaces solely. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sunchao commented on pull request #38352: [SPARK-40801][BUILD][3.2] Upgrade `Apache commons-text` to 1.10
sunchao commented on PR #38352: URL: https://github.com/apache/spark/pull/38352#issuecomment-1304899326 I'm going to start working on it this week. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #38527: [SPARK-40875][CONNECT] Improve aggregate in Connect DSL
AmplabJenkins commented on PR #38527: URL: https://github.com/apache/spark/pull/38527#issuecomment-1304898323 Can one of the admins verify this patch? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on pull request #38527: [SPARK-40875][CONNECT] Improve aggregate in Connect DSL
amaliujia commented on PR #38527: URL: https://github.com/apache/spark/pull/38527#issuecomment-1304893285 R: @cloud-fan -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia opened a new pull request, #38527: [SPARK-40875][CONNECT] Improve aggregate in Connect DSL
amaliujia opened a new pull request, #38527: URL: https://github.com/apache/spark/pull/38527 ### What changes were proposed in this pull request? This PR adds the aggregate expressions (or named result expressions) for Aggregate in Connect proto and DSL. On the server side, this PR also differentiates named expression (e.g. with `alias`) and non-named expression (so server will wraps `UnresolvedAlias` and Catalyst will generate alias for such expression). ### Why are the changes needed? Improve API coverage. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? UT -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SandishKumarHN commented on a diff in pull request #38515: [SPARK-41015][SQL][PROTOBUF] UnitTest null check for data generator
SandishKumarHN commented on code in PR #38515: URL: https://github.com/apache/spark/pull/38515#discussion_r1014891059 ## sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryCompilationErrors.scala: ## @@ -3344,7 +3344,7 @@ private[sql] object QueryCompilationErrors extends QueryErrorsBase { def failedParsingDescriptorError(descFilePath: String, cause: Throwable): Throwable = { new AnalysisException( errorClass = "CANNOT_CONSTRUCT_PROTOBUF_DESCRIPTOR", Review Comment: @MaxGekk added -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dwsmith1983 commented on pull request #38510: [MINOR][DOC] revisions for spark sql performance tuning to improve readability and grammar
dwsmith1983 commented on PR #38510: URL: https://github.com/apache/spark/pull/38510#issuecomment-1304882120 @srowen so the streaming test suite passed now so it should be fine now. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] srowen commented on pull request #38510: [MINOR][DOC] revisions for spark sql performance tuning to improve readability and grammar
srowen commented on PR #38510: URL: https://github.com/apache/spark/pull/38510#issuecomment-1304879526 Weird, the error is on code that isn't in the repo (now). Can you fully rebase and force-push your changes? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dwsmith1983 commented on pull request #38510: [MINOR][DOC] revisions for spark sql performance tuning to improve readability and grammar
dwsmith1983 commented on PR #38510: URL: https://github.com/apache/spark/pull/38510#issuecomment-1304878701 @srowen is the failure related to the bot tagging of connect? There is some connect file that was added after this tag. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ljfgem commented on a diff in pull request #35636: [SPARK-31357][SQL][WIP] Catalog API for view metadata
ljfgem commented on code in PR #35636: URL: https://github.com/apache/spark/pull/35636#discussion_r1014877861 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/V2ViewDescription.scala: ## @@ -0,0 +1,55 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.analysis + +import scala.collection.JavaConverters._ + +import org.apache.spark.sql.catalyst.plans.logical.ViewDescription +import org.apache.spark.sql.connector.catalog.{View, ViewCatalog} +import org.apache.spark.sql.types.StructType + +/** + * View description backed by a View in V2 catalog. + * + * @param view a view in V2 catalog + */ +case class V2ViewDescription( +override val identifier: String, +view: View) extends ViewDescription { + + override val schema: StructType = view.schema + + override val viewText: Option[String] = None Review Comment: What is the benefit of doing such a modification? And by the way, I found `output` parameter in `ShowCreateViewExec(output: Seq[Attribute], desc: V2ViewDescription)` is never used? Maybe we can remove this parameter. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ljfgem commented on a diff in pull request #35636: [SPARK-31357][SQL][WIP] Catalog API for view metadata
ljfgem commented on code in PR #35636: URL: https://github.com/apache/spark/pull/35636#discussion_r955031916 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/V2ViewDescription.scala: ## @@ -0,0 +1,55 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.analysis + +import scala.collection.JavaConverters._ + +import org.apache.spark.sql.catalyst.plans.logical.ViewDescription +import org.apache.spark.sql.connector.catalog.{View, ViewCatalog} +import org.apache.spark.sql.types.StructType + +/** + * View description backed by a View in V2 catalog. + * + * @param view a view in V2 catalog + */ +case class V2ViewDescription( +override val identifier: String, +view: View) extends ViewDescription { + + override val schema: StructType = view.schema + + override val viewText: Option[String] = None + + override val viewCatalogAndNamespace: Seq[String] = +view.currentCatalog +: view.currentNamespace.toSeq + + override val viewQueryColumnNames: Seq[String] = view.schema.fieldNames Review Comment: I think we need to check if `view.schema` is null or not first? Maybe change the schema type to `Option[StructType]` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk commented on pull request #38519: [MINOR][SQL] Remove unused an error class and query error methods
MaxGekk commented on PR #38519: URL: https://github.com/apache/spark/pull/38519#issuecomment-1304866695 @panbingkun @LuciferYang @itholic @cloud-fan @srielau Could you review this PR, please. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk commented on a diff in pull request #38490: [SPARK-41009][SQL] Rename the error class `_LEGACY_ERROR_TEMP_1070` to `LOCATION_ALREADY_EXISTS`
MaxGekk commented on code in PR #38490: URL: https://github.com/apache/spark/pull/38490#discussion_r1014873100 ## core/src/main/resources/error/error-classes.json: ## @@ -668,6 +668,24 @@ } } }, + "LOCATION_ALREADY_EXISTS" : { +"message" : [ + "Cannot create the location because it already exists.", + "Choose a different path or remove the existing location." Review Comment: > I.e. in the most trivial solution just move ok. I'll move it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk commented on a diff in pull request #38521: [SPARK-41020][SQL] Rename the error class `_LEGACY_ERROR_TEMP_1019` to `STAR_GROUP_BY_POS`
MaxGekk commented on code in PR #38521: URL: https://github.com/apache/spark/pull/38521#discussion_r1014871999 ## core/src/main/resources/error/error-classes.json: ## @@ -912,6 +912,11 @@ ], "sqlState" : "22023" }, + "STAR_GROUP_BY_POS" : { Review Comment: It is better to ask the author of the restriction, cc @gatorsmile and the committer @cloud-fan https://github.com/apache/spark/pull/11846/files#diff-ed19f376a63eba52eea59ca71f3355d4495fad4fad4db9a3324aade0d4986a47R389-R392 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk commented on a diff in pull request #38521: [SPARK-41020][SQL] Rename the error class `_LEGACY_ERROR_TEMP_1019` to `STAR_GROUP_BY_POS`
MaxGekk commented on code in PR #38521: URL: https://github.com/apache/spark/pull/38521#discussion_r1014871999 ## core/src/main/resources/error/error-classes.json: ## @@ -912,6 +912,11 @@ ], "sqlState" : "22023" }, + "STAR_GROUP_BY_POS" : { Review Comment: It is better to ask the author of the restriction, cc @gatorsmile https://github.com/apache/spark/pull/11846/files#diff-ed19f376a63eba52eea59ca71f3355d4495fad4fad4db9a3324aade0d4986a47R389-R392 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk closed pull request #38522: [SPARK-41022][SQL][TESTS] Test the error class: DEFAULT_DATABASE_NOT_EXISTS, INDEX_ALREADY_EXISTS, INDEX_NOT_FOUND, ROUTINE_NOT_FOUND
MaxGekk closed pull request #38522: [SPARK-41022][SQL][TESTS] Test the error class: DEFAULT_DATABASE_NOT_EXISTS, INDEX_ALREADY_EXISTS, INDEX_NOT_FOUND, ROUTINE_NOT_FOUND URL: https://github.com/apache/spark/pull/38522 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk commented on pull request #38522: [SPARK-41022][SQL][TESTS] Test the error class: DEFAULT_DATABASE_NOT_EXISTS, INDEX_ALREADY_EXISTS, INDEX_NOT_FOUND, ROUTINE_NOT_FOUND
MaxGekk commented on PR #38522: URL: https://github.com/apache/spark/pull/38522#issuecomment-1304859631 +1, LGTM. Merging to master. Thank you, @panbingkun. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] srielau commented on a diff in pull request #38521: [SPARK-41020][SQL] Rename the error class `_LEGACY_ERROR_TEMP_1019` to `STAR_GROUP_BY_POS`
srielau commented on code in PR #38521: URL: https://github.com/apache/spark/pull/38521#discussion_r1014863229 ## core/src/main/resources/error/error-classes.json: ## @@ -912,6 +912,11 @@ ], "sqlState" : "22023" }, + "STAR_GROUP_BY_POS" : { Review Comment: LGTM ... but why do we have this restriction? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] srielau commented on a diff in pull request #38490: [SPARK-41009][SQL] Rename the error class `_LEGACY_ERROR_TEMP_1070` to `LOCATION_ALREADY_EXISTS`
srielau commented on code in PR #38490: URL: https://github.com/apache/spark/pull/38490#discussion_r1014862790 ## core/src/main/resources/error/error-classes.json: ## @@ -668,6 +668,24 @@ } } }, + "LOCATION_ALREADY_EXISTS" : { +"message" : [ + "Cannot create the location because it already exists.", + "Choose a different path or remove the existing location." Review Comment: I didn't mean to remove advise. I means to collect all the "facts" up front and then collect all the "advise" in the end. I.e. in the most trivial solution just move "Choose a different path or remove the existing location." int the suberror message Anyway, these are all nits... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] aokolnychyi commented on a diff in pull request #36304: [SPARK-38959][SQL] DS V2: Support runtime group filtering in row-level commands
aokolnychyi commented on code in PR #36304: URL: https://github.com/apache/spark/pull/36304#discussion_r1014855632 ## sql/core/src/main/scala/org/apache/spark/sql/execution/dynamicpruning/RowLevelOperationRuntimeGroupFiltering.scala: ## @@ -0,0 +1,98 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.dynamicpruning + +import org.apache.spark.sql.catalyst.expressions.{And, Attribute, DynamicPruningSubquery, Expression, PredicateHelper, V2ExpressionUtils} +import org.apache.spark.sql.catalyst.expressions.Literal.TrueLiteral +import org.apache.spark.sql.catalyst.planning.GroupBasedRowLevelOperation +import org.apache.spark.sql.catalyst.plans.logical.{Filter, LogicalPlan, Project} +import org.apache.spark.sql.catalyst.rules.Rule +import org.apache.spark.sql.connector.read.SupportsRuntimeV2Filtering +import org.apache.spark.sql.execution.datasources.v2.{DataSourceV2Implicits, DataSourceV2Relation, DataSourceV2ScanRelation} + +/** + * A rule that assigns a subquery to filter groups in row-level operations at runtime. + * + * Data skipping during job planning for row-level operations is limited to expressions that can be + * converted to data source filters. Since not all expressions can be pushed down that way and + * rewriting groups is expensive, Spark allows data sources to filter group at runtime. + * If the primary scan in a group-based row-level operation supports runtime filtering, this rule + * will inject a subquery to find all rows that match the condition so that data sources know + * exactly which groups must be rewritten. + * + * Note this rule only applies to group-based row-level operations. + */ +case class RowLevelOperationRuntimeGroupFiltering(optimizeSubqueries: Rule[LogicalPlan]) Review Comment: An alternative idea could be to move `OptimizeSubqueries` into its own file. However, that's tricky too as it calls the optimizer. ``` Optimizer.this.execute(Subquery.fromExpression(s)) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] aokolnychyi commented on pull request #38526: [SPARK-38959][SQL][FOLLOW-UP] Address feedback for RowLevelOperationRuntimeGroupFiltering
aokolnychyi commented on PR #38526: URL: https://github.com/apache/spark/pull/38526#issuecomment-1304834781 @cloud-fan, could you take a look to see if that's what you meant in PR #36304? There is also one open question [here](https://github.com/apache/spark/pull/36304/files#r999648945). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] aokolnychyi opened a new pull request, #38526: [SPARK-38959][SQL][FOLLOW-UP] Address feedback for RowLevelOperationRuntimeGroupFiltering
aokolnychyi opened a new pull request, #38526: URL: https://github.com/apache/spark/pull/38526 ### What changes were proposed in this pull request? This PR is to address the feedback on PR #36304 after that change was merged. ### Why are the changes needed? These changes are needed for better code quality. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] aokolnychyi commented on a diff in pull request #36304: [SPARK-38959][SQL] DS V2: Support runtime group filtering in row-level commands
aokolnychyi commented on code in PR #36304: URL: https://github.com/apache/spark/pull/36304#discussion_r1014850793 ## sql/core/src/main/scala/org/apache/spark/sql/execution/SparkOptimizer.scala: ## @@ -50,7 +50,8 @@ class SparkOptimizer( override def defaultBatches: Seq[Batch] = (preOptimizationBatches ++ super.defaultBatches :+ Batch("Optimize Metadata Only Query", Once, OptimizeMetadataOnlyQuery(catalog)) :+ Batch("PartitionPruning", Once, - PartitionPruning) :+ + PartitionPruning, + RowLevelOperationRuntimeGroupFiltering(OptimizeSubqueries)) :+ Review Comment: This would be much cleaner but SPARK-36444 removed `OptimizeSubqueries` from that batch. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] aokolnychyi commented on a diff in pull request #36304: [SPARK-38959][SQL] DS V2: Support runtime group filtering in row-level commands
aokolnychyi commented on code in PR #36304: URL: https://github.com/apache/spark/pull/36304#discussion_r1014850793 ## sql/core/src/main/scala/org/apache/spark/sql/execution/SparkOptimizer.scala: ## @@ -50,7 +50,8 @@ class SparkOptimizer( override def defaultBatches: Seq[Batch] = (preOptimizationBatches ++ super.defaultBatches :+ Batch("Optimize Metadata Only Query", Once, OptimizeMetadataOnlyQuery(catalog)) :+ Batch("PartitionPruning", Once, - PartitionPruning) :+ + PartitionPruning, + RowLevelOperationRuntimeGroupFiltering(OptimizeSubqueries)) :+ Review Comment: This would be much cleaner but SPARK-36444 removed `OptimizeSubqueries`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] aokolnychyi commented on a diff in pull request #36304: [SPARK-38959][SQL] DS V2: Support runtime group filtering in row-level commands
aokolnychyi commented on code in PR #36304: URL: https://github.com/apache/spark/pull/36304#discussion_r1014850975 ## sql/core/src/main/scala/org/apache/spark/sql/execution/dynamicpruning/RowLevelOperationRuntimeGroupFiltering.scala: ## @@ -0,0 +1,98 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.dynamicpruning + +import org.apache.spark.sql.catalyst.expressions.{And, Attribute, DynamicPruningSubquery, Expression, PredicateHelper, V2ExpressionUtils} +import org.apache.spark.sql.catalyst.expressions.Literal.TrueLiteral +import org.apache.spark.sql.catalyst.planning.GroupBasedRowLevelOperation +import org.apache.spark.sql.catalyst.plans.logical.{Filter, LogicalPlan, Project} +import org.apache.spark.sql.catalyst.rules.Rule +import org.apache.spark.sql.connector.read.SupportsRuntimeV2Filtering +import org.apache.spark.sql.execution.datasources.v2.{DataSourceV2Implicits, DataSourceV2Relation, DataSourceV2ScanRelation} + +/** + * A rule that assigns a subquery to filter groups in row-level operations at runtime. + * + * Data skipping during job planning for row-level operations is limited to expressions that can be + * converted to data source filters. Since not all expressions can be pushed down that way and + * rewriting groups is expensive, Spark allows data sources to filter group at runtime. + * If the primary scan in a group-based row-level operation supports runtime filtering, this rule + * will inject a subquery to find all rows that match the condition so that data sources know + * exactly which groups must be rewritten. + * + * Note this rule only applies to group-based row-level operations. + */ +case class RowLevelOperationRuntimeGroupFiltering(optimizeSubqueries: Rule[LogicalPlan]) Review Comment: Like I said [here](https://github.com/apache/spark/pull/36304#discussion_r1014850793), we could move the new rule into a separate batch and add `OptimizeSubqueries` to it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] aokolnychyi commented on a diff in pull request #36304: [SPARK-38959][SQL] DS V2: Support runtime group filtering in row-level commands
aokolnychyi commented on code in PR #36304: URL: https://github.com/apache/spark/pull/36304#discussion_r1014850975 ## sql/core/src/main/scala/org/apache/spark/sql/execution/dynamicpruning/RowLevelOperationRuntimeGroupFiltering.scala: ## @@ -0,0 +1,98 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.dynamicpruning + +import org.apache.spark.sql.catalyst.expressions.{And, Attribute, DynamicPruningSubquery, Expression, PredicateHelper, V2ExpressionUtils} +import org.apache.spark.sql.catalyst.expressions.Literal.TrueLiteral +import org.apache.spark.sql.catalyst.planning.GroupBasedRowLevelOperation +import org.apache.spark.sql.catalyst.plans.logical.{Filter, LogicalPlan, Project} +import org.apache.spark.sql.catalyst.rules.Rule +import org.apache.spark.sql.connector.read.SupportsRuntimeV2Filtering +import org.apache.spark.sql.execution.datasources.v2.{DataSourceV2Implicits, DataSourceV2Relation, DataSourceV2ScanRelation} + +/** + * A rule that assigns a subquery to filter groups in row-level operations at runtime. + * + * Data skipping during job planning for row-level operations is limited to expressions that can be + * converted to data source filters. Since not all expressions can be pushed down that way and + * rewriting groups is expensive, Spark allows data sources to filter group at runtime. + * If the primary scan in a group-based row-level operation supports runtime filtering, this rule + * will inject a subquery to find all rows that match the condition so that data sources know + * exactly which groups must be rewritten. + * + * Note this rule only applies to group-based row-level operations. + */ +case class RowLevelOperationRuntimeGroupFiltering(optimizeSubqueries: Rule[LogicalPlan]) Review Comment: Like I said [here](https://github.com/apache/spark/pull/36304#discussion_r1014850793), we could move the new rule into a separate batch and add `OptimizeSubqueries` to it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] aokolnychyi commented on a diff in pull request #36304: [SPARK-38959][SQL] DS V2: Support runtime group filtering in row-level commands
aokolnychyi commented on code in PR #36304: URL: https://github.com/apache/spark/pull/36304#discussion_r1014850793 ## sql/core/src/main/scala/org/apache/spark/sql/execution/SparkOptimizer.scala: ## @@ -50,7 +50,8 @@ class SparkOptimizer( override def defaultBatches: Seq[Batch] = (preOptimizationBatches ++ super.defaultBatches :+ Batch("Optimize Metadata Only Query", Once, OptimizeMetadataOnlyQuery(catalog)) :+ Batch("PartitionPruning", Once, - PartitionPruning) :+ + PartitionPruning, + RowLevelOperationRuntimeGroupFiltering(OptimizeSubqueries)) :+ Review Comment: This would be much cleaner but SPARK-36444 removed `OptimizeSubqueries`. I can add a separate batch for group filtering in row-level operations. How does that sound, @cloud-fan? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] aokolnychyi commented on a diff in pull request #36304: [SPARK-38959][SQL] DS V2: Support runtime group filtering in row-level commands
aokolnychyi commented on code in PR #36304: URL: https://github.com/apache/spark/pull/36304#discussion_r1014850614 ## sql/core/src/main/scala/org/apache/spark/sql/execution/dynamicpruning/RowLevelOperationRuntimeGroupFiltering.scala: ## @@ -0,0 +1,98 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.dynamicpruning + +import org.apache.spark.sql.catalyst.expressions.{And, Attribute, DynamicPruningSubquery, Expression, PredicateHelper, V2ExpressionUtils} +import org.apache.spark.sql.catalyst.expressions.Literal.TrueLiteral +import org.apache.spark.sql.catalyst.planning.GroupBasedRowLevelOperation +import org.apache.spark.sql.catalyst.plans.logical.{Filter, LogicalPlan, Project} +import org.apache.spark.sql.catalyst.rules.Rule +import org.apache.spark.sql.connector.read.SupportsRuntimeV2Filtering +import org.apache.spark.sql.execution.datasources.v2.{DataSourceV2Implicits, DataSourceV2Relation, DataSourceV2ScanRelation} + +/** + * A rule that assigns a subquery to filter groups in row-level operations at runtime. + * + * Data skipping during job planning for row-level operations is limited to expressions that can be + * converted to data source filters. Since not all expressions can be pushed down that way and + * rewriting groups is expensive, Spark allows data sources to filter group at runtime. + * If the primary scan in a group-based row-level operation supports runtime filtering, this rule + * will inject a subquery to find all rows that match the condition so that data sources know + * exactly which groups must be rewritten. + * + * Note this rule only applies to group-based row-level operations. + */ +case class RowLevelOperationRuntimeGroupFiltering(optimizeSubqueries: Rule[LogicalPlan]) Review Comment: @sunchao is correct. It wasn't easy to call `OptimizeSubqueries` outside `Optimizer`. Hence, I had to come up with this workaround. @cloud-fan, I also considered simply adding `OptimizeSubqueries` to the batch with runtime partition filtering. However, SPARK-36444 specifically removed it from there. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] aokolnychyi commented on a diff in pull request #36304: [SPARK-38959][SQL] DS V2: Support runtime group filtering in row-level commands
aokolnychyi commented on code in PR #36304: URL: https://github.com/apache/spark/pull/36304#discussion_r1014850115 ## sql/core/src/main/scala/org/apache/spark/sql/execution/dynamicpruning/RowLevelOperationRuntimeGroupFiltering.scala: ## @@ -0,0 +1,98 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.dynamicpruning + +import org.apache.spark.sql.catalyst.expressions.{And, Attribute, DynamicPruningSubquery, Expression, PredicateHelper, V2ExpressionUtils} +import org.apache.spark.sql.catalyst.expressions.Literal.TrueLiteral +import org.apache.spark.sql.catalyst.planning.GroupBasedRowLevelOperation +import org.apache.spark.sql.catalyst.plans.logical.{Filter, LogicalPlan, Project} +import org.apache.spark.sql.catalyst.rules.Rule +import org.apache.spark.sql.connector.read.SupportsRuntimeV2Filtering +import org.apache.spark.sql.execution.datasources.v2.{DataSourceV2Implicits, DataSourceV2Relation, DataSourceV2ScanRelation} + +/** + * A rule that assigns a subquery to filter groups in row-level operations at runtime. + * + * Data skipping during job planning for row-level operations is limited to expressions that can be + * converted to data source filters. Since not all expressions can be pushed down that way and + * rewriting groups is expensive, Spark allows data sources to filter group at runtime. + * If the primary scan in a group-based row-level operation supports runtime filtering, this rule + * will inject a subquery to find all rows that match the condition so that data sources know + * exactly which groups must be rewritten. + * + * Note this rule only applies to group-based row-level operations. + */ +case class RowLevelOperationRuntimeGroupFiltering(optimizeSubqueries: Rule[LogicalPlan]) + extends Rule[LogicalPlan] with PredicateHelper { + + import DataSourceV2Implicits._ + + override def apply(plan: LogicalPlan): LogicalPlan = plan transformDown { +// apply special dynamic filtering only for group-based row-level operations +case GroupBasedRowLevelOperation(replaceData, cond, +DataSourceV2ScanRelation(_, scan: SupportsRuntimeV2Filtering, _, _, _)) +if conf.runtimeRowLevelOperationGroupFilterEnabled && cond != TrueLiteral => + + // use reference equality on scan to find required scan relations + val newQuery = replaceData.query transformUp { +case r: DataSourceV2ScanRelation if r.scan eq scan => + // use the original table instance that was loaded for this row-level operation + // in order to leverage a regular batch scan in the group filter query + val originalTable = r.relation.table.asRowLevelOperationTable.table + val relation = r.relation.copy(table = originalTable) + val matchingRowsPlan = buildMatchingRowsPlan(relation, cond) + + val filterAttrs = scan.filterAttributes + val buildKeys = V2ExpressionUtils.resolveRefs[Attribute](filterAttrs, matchingRowsPlan) + val pruningKeys = V2ExpressionUtils.resolveRefs[Attribute](filterAttrs, r) + val dynamicPruningCond = buildDynamicPruningCond(matchingRowsPlan, buildKeys, pruningKeys) + + Filter(dynamicPruningCond, r) + } + + // optimize subqueries to rewrite them as joins and trigger job planning + replaceData.copy(query = optimizeSubqueries(newQuery)) Review Comment: Not really, @cloud-fan. This rule simply attaches a runtime filter to the plan that was created while rewriting the delete. We do replace the query but it is pretty much the same plan, just with an extra runtime filter. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] srowen commented on pull request #38523: [SPARK-41023][BUILD] Upgrade Jackson to 2.14.0
srowen commented on PR #38523: URL: https://github.com/apache/spark/pull/38523#issuecomment-1304827262 Looks OK if tests pass. I think we need to update the two together, so this resolves both JIRAs -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] bsikander commented on pull request #38352: [SPARK-40801][BUILD][3.2] Upgrade `Apache commons-text` to 1.10
bsikander commented on PR #38352: URL: https://github.com/apache/spark/pull/38352#issuecomment-1304825343 @sunchao thank you for you efforts. When can we expect the release of 3.2.3? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] srowen commented on pull request #38499: [MINOR][DOC] updated some grammar and a missed period in the tuning doc
srowen commented on PR #38499: URL: https://github.com/apache/spark/pull/38499#issuecomment-1304822113 Merged to master -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] srowen closed pull request #38499: [MINOR][DOC] updated some grammar and a missed period in the tuning doc
srowen closed pull request #38499: [MINOR][DOC] updated some grammar and a missed period in the tuning doc URL: https://github.com/apache/spark/pull/38499 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] srowen commented on pull request #38525: [SPARK-40950][BUILD][FOLLOWUP] Fix Scala 2.13 Mima check
srowen commented on PR #38525: URL: https://github.com/apache/spark/pull/38525#issuecomment-1304811696 Merged to master -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] srowen closed pull request #38525: [SPARK-40950][BUILD][FOLLOWUP] Fix Scala 2.13 Mima check
srowen closed pull request #38525: [SPARK-40950][BUILD][FOLLOWUP] Fix Scala 2.13 Mima check URL: https://github.com/apache/spark/pull/38525 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] bjornjorgensen commented on pull request #38523: [SPARK-41023][BUILD] Upgrade Jackson to 2.14.0
bjornjorgensen commented on PR #38523: URL: https://github.com/apache/spark/pull/38523#issuecomment-1304801289 @pjfanning Yes, but I just wanna ask @LuciferYang if he has been looking at it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] pjfanning commented on pull request #38523: [SPARK-41023][BUILD] Upgrade Jackson to 2.14.0
pjfanning commented on PR #38523: URL: https://github.com/apache/spark/pull/38523#issuecomment-1304800274 Shouldn't guava changes be done in a separate PR? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] bjornjorgensen commented on pull request #38523: [SPARK-41023][BUILD] Upgrade Jackson to 2.14.0
bjornjorgensen commented on PR #38523: URL: https://github.com/apache/spark/pull/38523#issuecomment-1304798904 @LuciferYang Have you tried to upgrade guava to 31.1-jre? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] pjfanning commented on pull request #38523: [SPARK-41023][BUILD] Upgrade Jackson to 2.14.0
pjfanning commented on PR #38523: URL: https://github.com/apache/spark/pull/38523#issuecomment-1304798867 @LuciferYang feel free to continue with your PR. I marked SPARK-41023 as a duplicate of SPARK-40911. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] bjornjorgensen commented on pull request #38523: [SPARK-41023][BUILD] Upgrade Jackson to 2.14.0
bjornjorgensen commented on PR #38523: URL: https://github.com/apache/spark/pull/38523#issuecomment-1304797597 https://issues.apache.org/jira/browse/SPARK-40911 I think that @pjfanning will work on this one. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] panbingkun commented on a diff in pull request #38520: [SPARK-41021][SQL][TESTS] Test some subclasses of error class DATATYPE_MISMATCH
panbingkun commented on code in PR #38520: URL: https://github.com/apache/spark/pull/38520#discussion_r1014823662 ## sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/JsonExpressionsSuite.scala: ## @@ -581,6 +591,19 @@ class JsonExpressionsSuite extends SparkFunSuite with ExpressionEvalHelper with ) } + test("to_json - struct: unable to convert column of ObjectType to JSON") { +val schema = StructType(StructField("a", ObjectType(classOf[java.lang.Integer])) :: Nil) +val structData = Literal.create(create_row(Integer.valueOf(1)), schema) +assert(StructsToJson(Map.empty, structData).checkInputDataTypes() == + DataTypeMismatch( +errorSubClass = "CANNOT_CONVERT_TO_JSON", Review Comment: Another UT [JsonFunctionsSuite.scala](https://github.com/apache/spark/pull/38520/files#diff-f406c940ba1e8f95cc1d87500cc0906fff593fbc336ba66d04daf784121a6050) https://user-images.githubusercontent.com/15246973/200171460-65c1e2fc-0018-438e-84ce-7382b91ec97d.png";> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] panbingkun commented on a diff in pull request #38520: [SPARK-41021][SQL][TESTS] Test some subclasses of error class DATATYPE_MISMATCH
panbingkun commented on code in PR #38520: URL: https://github.com/apache/spark/pull/38520#discussion_r1014823446 ## sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/JsonExpressionsSuite.scala: ## @@ -406,6 +407,15 @@ class JsonExpressionsSuite extends SparkFunSuite with ExpressionEvalHelper with InternalRow(UTF8String.fromString("1"), null, UTF8String.fromString("1"))) } + test("json_tuple - all arguments must be strings") { +assert(JsonTuple(Seq(Literal(888), Literal(999))).checkInputDataTypes() == + DataTypeMismatch( +errorSubClass = "NON_STRING_TYPE", Review Comment: Another UT [JsonFunctionsSuite.scala](https://github.com/apache/spark/pull/38520/files#diff-f406c940ba1e8f95cc1d87500cc0906fff593fbc336ba66d04daf784121a6050) https://user-images.githubusercontent.com/15246973/200171401-b88aa6ab-8941-4f2c-86b8-68fd7e81c2d4.png";> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] panbingkun commented on a diff in pull request #38520: [SPARK-41021][SQL][TESTS] Test some subclasses of error class DATATYPE_MISMATCH
panbingkun commented on code in PR #38520: URL: https://github.com/apache/spark/pull/38520#discussion_r1014823040 ## sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CollectionExpressionsSuite.scala: ## @@ -346,6 +358,18 @@ class CollectionExpressionsSuite extends SparkFunSuite with ExpressionEvalHelper assert(errorSubClass === "INVALID_MAP_KEY_TYPE") assert(messageParameters === Map("keyType" -> "\"MAP\"")) } + +// accepts only arrays of pair structs +val mapWrongType = MapFromEntries(Literal(1)) +assert(mapWrongType.checkInputDataTypes() == + DataTypeMismatch( +errorSubClass = "MAP_FROM_ENTRIES_WRONG_TYPE", +messageParameters = Map( Review Comment: Anothor UT [DataFrameFunctionsSuite.scala](https://github.com/apache/spark/pull/38520/files#diff-03f5f4bb4f15c5640fae904891b75a1d21f7bf8fa92c4abcb9b35d1500c37ae7) https://user-images.githubusercontent.com/15246973/200171295-aebe7755-a31e-4c8d-92b1-146dbe3ef437.png";> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] panbingkun commented on a diff in pull request #38520: [SPARK-41021][SQL][TESTS] Test some subclasses of error class DATATYPE_MISMATCH
panbingkun commented on code in PR #38520: URL: https://github.com/apache/spark/pull/38520#discussion_r1014823040 ## sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CollectionExpressionsSuite.scala: ## @@ -346,6 +358,18 @@ class CollectionExpressionsSuite extends SparkFunSuite with ExpressionEvalHelper assert(errorSubClass === "INVALID_MAP_KEY_TYPE") assert(messageParameters === Map("keyType" -> "\"MAP\"")) } + +// accepts only arrays of pair structs +val mapWrongType = MapFromEntries(Literal(1)) +assert(mapWrongType.checkInputDataTypes() == + DataTypeMismatch( +errorSubClass = "MAP_FROM_ENTRIES_WRONG_TYPE", +messageParameters = Map( Review Comment: Anothor UT https://user-images.githubusercontent.com/15246973/200171295-aebe7755-a31e-4c8d-92b1-146dbe3ef437.png";> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon closed pull request #38524: [SPARK-41024][BUILD] Upgrade scala-maven-plugin to 4.7.2
HyukjinKwon closed pull request #38524: [SPARK-41024][BUILD] Upgrade scala-maven-plugin to 4.7.2 URL: https://github.com/apache/spark/pull/38524 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #38524: [SPARK-41024][BUILD] Upgrade scala-maven-plugin to 4.7.2
HyukjinKwon commented on PR #38524: URL: https://github.com/apache/spark/pull/38524#issuecomment-1304783719 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] mridulm commented on pull request #38525: [SPARK-40950][BUILD][FOLLOWUP] Fix Scala 2.13 Mima check
mridulm commented on PR #38525: URL: https://github.com/apache/spark/pull/38525#issuecomment-1304775260 Will wait for tests to pass ... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #38491: [MINOR][CONNECT] Remove unused import in commands.proto
LuciferYang commented on PR #38491: URL: https://github.com/apache/spark/pull/38491#issuecomment-1304764271 looks fine -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #38525: [SPARK-40950][BUILD][FOLLOWUP] Fix Scala 2.13 Mima check
LuciferYang commented on PR #38525: URL: https://github.com/apache/spark/pull/38525#issuecomment-1304762343 cc @@dongjoon-hyun @srowen @mridulm @HyukjinKwon -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org