[GitHub] [spark] huaxingao commented on a diff in pull request #37746: [SPARK-40293][SQL] Make the V2 table error message more meaningful

2022-09-01 Thread GitBox


huaxingao commented on code in PR #37746:
URL: https://github.com/apache/spark/pull/37746#discussion_r961313022


##
core/src/main/resources/error/error-classes.json:
##
@@ -520,6 +520,11 @@
   "NATURAL CROSS JOIN."
 ]
   },
+  "OPERATION_ONLY_SUPPORTED_WITH_V2_TABLE" : {
+"message" : [
+  "Table `spark_catalog`.`default`.`t1` does not support . 
Please check the current catalog and namespace to make sure the qualified table 
name is expected, and also check the catalog implementation which is configured 
by 'spark.sql.catalog.spark_catalog'."

Review Comment:
   Fixed. Thanks!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] huaxingao commented on a diff in pull request #37746: [SPARK-40293][SQL] Make the V2 table error message more meaningful

2022-09-01 Thread GitBox


huaxingao commented on code in PR #37746:
URL: https://github.com/apache/spark/pull/37746#discussion_r961312863


##
core/src/main/resources/error/error-classes.json:
##
@@ -520,6 +520,11 @@
   "NATURAL CROSS JOIN."
 ]
   },
+  "OPERATION_ONLY_SUPPORTED_WITH_V2_TABLE" : {

Review Comment:
   Sounds good. Changed.



##
core/src/main/resources/error/error-classes.json:
##
@@ -520,6 +520,11 @@
   "NATURAL CROSS JOIN."
 ]
   },
+  "OPERATION_ONLY_SUPPORTED_WITH_V2_TABLE" : {
+"message" : [
+  "Table `spark_catalog`.`default`.`t1` does not support . 
Please check the current catalog and namespace to make sure the qualified table 
name is expected, and also check the catalog implementation which is configured 
by 'spark.sql.catalog.spark_catalog'."

Review Comment:
   Fixed. Thanks!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng opened a new pull request, #37767: [SPARK-39284][FOLLOW] Add Groupby.mad to API references

2022-09-01 Thread GitBox


zhengruifeng opened a new pull request, #37767:
URL: https://github.com/apache/spark/pull/37767

   ### What changes were proposed in this pull request?
   Add `Groupby.mad` to API references
   
   
   ### Why are the changes needed?
   `Groupby.mad` was implemented in https://github.com/apache/spark/pull/36660, 
but I forgot to add it to the doc
   
   
   ### Does this PR introduce _any_ user-facing change?
   yes, this API will be listed in the references
   
   
   ### How was this patch tested?
   existing doc building
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng commented on a diff in pull request #37756: [SPARK-40305][PS] Implement Groupby.sem

2022-09-01 Thread GitBox


zhengruifeng commented on code in PR #37756:
URL: https://github.com/apache/spark/pull/37756#discussion_r961280658


##
python/pyspark/pandas/groupby.py:
##
@@ -827,6 +827,76 @@ def mad(self) -> FrameLike:
 
 return self._prepare_return(DataFrame(internal))
 
+def sem(self, ddof: int = 1) -> FrameLike:
+"""
+Compute standard error of the mean of groups, excluding missing values.
+
+.. versionadded:: 3.4.0
+
+Parameters
+--
+ddof : int, default 1
+Delta Degrees of Freedom. The divisor used in calculations is N - 
ddof,
+where N represents the number of elements.
+
+Examples
+
+>>> df = ps.DataFrame({"A": [1, 2, 1, 1], "B": [True, False, False, 
True],
+..."C": [3, None, 3, 4], "D": ["a", "b", "b", 
"a"]})
+
+>>> df.groupby("A").sem()
+  B C
+A
+1  0.33  0.33
+2   NaN   NaN
+
+>>> df.groupby("D").sem(ddof=1)
+ ABC
+D
+a  0.0  0.0  0.5
+b  0.5  0.0  NaN
+
+>>> df.B.groupby(df.A).sem()
+A
+10.33
+2 NaN
+Name: B, dtype: float64
+
+See Also
+
+pyspark.pandas.Series.sem
+pyspark.pandas.DataFrame.sem
+"""
+if ddof not in [0, 1]:
+raise TypeError("ddof must be 0 or 1")
+
+# Raise the TypeError when all aggregation columns are of unaccepted 
data types
+all_unaccepted = True
+for _agg_col in self._agg_columns:
+if isinstance(_agg_col.spark.data_type, (NumericType, 
BooleanType)):
+all_unaccepted = False
+break
+if all_unaccepted:

Review Comment:
   nice, this is more concise. I just copied it from other places



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] hgs19921112 commented on pull request #37766: [SPARK-40288][SQL]After RemoveRedundantAggregates, PullOutGroupingExpressions should applied to avoid attribute missing when use complex

2022-09-01 Thread GitBox


hgs19921112 commented on PR #37766:
URL: https://github.com/apache/spark/pull/37766#issuecomment-1235057302

   cc @dongjoon-hyun 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] hgs19921112 opened a new pull request, #37766: [SPARK-40288][SQL]After RemoveRedundantAggregates, PullOutGroupingExpressions should applied to avoid attribute missing when use complex

2022-09-01 Thread GitBox


hgs19921112 opened a new pull request, #37766:
URL: https://github.com/apache/spark/pull/37766

   
   
   ### What changes were proposed in this pull request?
   
   RemoveRedundantAggregates will cause reference attribute missing when using 
complex expression in group by.
   
   ### Why are the changes needed?
   
   This bug will cause aggregate failure.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   ### How was this patch tested?
   
   unit test.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] hgs19921112 closed pull request #37765: [SPARK-40288][SQL]After `RemoveRedundantAggregates`, `PullOutGroupingExpressions` should applied to avoid attribute missing when use complex e

2022-09-01 Thread GitBox


hgs19921112 closed pull request #37765: [SPARK-40288][SQL]After 
`RemoveRedundantAggregates`, `PullOutGroupingExpressions` should applied to 
avoid attribute missing when use  complex expression.
URL: https://github.com/apache/spark/pull/37765


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] hgs19921112 commented on pull request #37765: [SPARK-40288][SQL]After `RemoveRedundantAggregates`, `PullOutGroupingExpressions` should applied to avoid attribute missing when use comp

2022-09-01 Thread GitBox


hgs19921112 commented on PR #37765:
URL: https://github.com/apache/spark/pull/37765#issuecomment-1235054014

   @github-actions


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] xinrong-meng commented on a diff in pull request #37756: [SPARK-40305][PS] Implement Groupby.sem

2022-09-01 Thread GitBox


xinrong-meng commented on code in PR #37756:
URL: https://github.com/apache/spark/pull/37756#discussion_r961275433


##
python/pyspark/pandas/tests/test_groupby.py:
##
@@ -3055,6 +3065,17 @@ def test_ddof(self):
 psdf.groupby("a")["b"].var(ddof=ddof).sort_index(),
 check_exact=False,
 )
+# var

Review Comment:
   nit: `# sem`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] xinrong-meng commented on a diff in pull request #37756: [SPARK-40305][PS] Implement Groupby.sem

2022-09-01 Thread GitBox


xinrong-meng commented on code in PR #37756:
URL: https://github.com/apache/spark/pull/37756#discussion_r961275185


##
python/pyspark/pandas/generic.py:
##
@@ -2189,7 +2189,7 @@ def std(psser: "Series") -> Column:
 return F.stddev_samp(spark_column)
 
 def sem(psser: "Series") -> Column:
-return std(psser) / pow(Frame._count_expr(psser), 0.5)
+return std(psser) / F.sqrt(Frame._count_expr(psser))

Review Comment:
   Nice!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] xinrong-meng commented on a diff in pull request #37756: [SPARK-40305][PS] Implement Groupby.sem

2022-09-01 Thread GitBox


xinrong-meng commented on code in PR #37756:
URL: https://github.com/apache/spark/pull/37756#discussion_r961271308


##
python/pyspark/pandas/groupby.py:
##
@@ -827,6 +827,76 @@ def mad(self) -> FrameLike:
 
 return self._prepare_return(DataFrame(internal))
 
+def sem(self, ddof: int = 1) -> FrameLike:
+"""
+Compute standard error of the mean of groups, excluding missing values.
+
+.. versionadded:: 3.4.0
+
+Parameters
+--
+ddof : int, default 1
+Delta Degrees of Freedom. The divisor used in calculations is N - 
ddof,
+where N represents the number of elements.
+
+Examples
+
+>>> df = ps.DataFrame({"A": [1, 2, 1, 1], "B": [True, False, False, 
True],
+..."C": [3, None, 3, 4], "D": ["a", "b", "b", 
"a"]})
+
+>>> df.groupby("A").sem()
+  B C
+A
+1  0.33  0.33
+2   NaN   NaN
+
+>>> df.groupby("D").sem(ddof=1)
+ ABC
+D
+a  0.0  0.0  0.5
+b  0.5  0.0  NaN
+
+>>> df.B.groupby(df.A).sem()
+A
+10.33
+2 NaN
+Name: B, dtype: float64
+
+See Also
+
+pyspark.pandas.Series.sem
+pyspark.pandas.DataFrame.sem
+"""
+if ddof not in [0, 1]:
+raise TypeError("ddof must be 0 or 1")
+
+# Raise the TypeError when all aggregation columns are of unaccepted 
data types
+all_unaccepted = True
+for _agg_col in self._agg_columns:
+if isinstance(_agg_col.spark.data_type, (NumericType, 
BooleanType)):
+all_unaccepted = False
+break
+if all_unaccepted:

Review Comment:
   nit: 
   ```
   any_accepted = any(isinstance(_agg_col.spark.data_type, (NumericType, 
BooleanType)) for _agg_col in self._agg_columns)
   if not any_accepted:
 raise TypeError("...")
   ```
   ?
   Feel free to keep the existing approach if you prefer :)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] hgs19921112 opened a new pull request, #37765: [SPARK-40288][SQL]After `RemoveRedundantAggregates`, `PullOutGroupingExpressions` should applied to avoid attribute missing when use com

2022-09-01 Thread GitBox


hgs19921112 opened a new pull request, #37765:
URL: https://github.com/apache/spark/pull/37765

   
   
   ### What changes were proposed in this pull request?
   
   RemoveRedundantAggregates will cause reference attribute missing when using 
complex expression in group by sentance.
   
   ### Why are the changes needed?
   
   Reproduce the bug:
   --table
   create  table miss_expr(id int,name string,age double) stored as textfile
   --data
   insert overwrite table miss_expr 
values(1,'ox',1.0),(1,'oox',2.0),(2,'ox',3.0),(2,'xxo',4.0)
   --failure sql
   insert overwrite table miss_expr
   select id,name,nage as n from(
   select id,name,if(age>3,100,200) as nage from miss_expr group by id,name,age
   ) group by id,name,nage
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   No.
   ### How was this patch tested?
   
   unit test.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng commented on pull request #37756: [SPARK-40305][PS] Implement Groupby.sem

2022-09-01 Thread GitBox


zhengruifeng commented on PR #37756:
URL: https://github.com/apache/spark/pull/37756#issuecomment-1235010056

   cc @itholic @xinrong-meng @HyukjinKwon 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] aokolnychyi commented on a diff in pull request #37749: [SPARK-40295][SQL] Allow v2 functions with literal args in write distribution/ordering

2022-09-01 Thread GitBox


aokolnychyi commented on code in PR #37749:
URL: https://github.com/apache/spark/pull/37749#discussion_r961236436


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/V2ExpressionUtils.scala:
##
@@ -105,18 +105,27 @@ object V2ExpressionUtils extends SQLConfHelper with 
Logging {
   TransformExpression(bound, resolvedRefs, Some(numBuckets))
 }
   }
-case NamedTransform(name, refs)
-if refs.length == 1 && refs.forall(_.isInstanceOf[NamedReference]) =>

Review Comment:
   On the other hand, skipping `KeyGroupedPartitioning` means we will never 
create `KeyGroupedShuffleSpec`, which has that validation logic. Maybe, that's 
a safer option.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] panbingkun commented on a diff in pull request #37700: [SPARK-40251][BUILD][MLLIB] Upgrade dev.ludovic.netlib from 2.2.1 to 3.0.2 & breeze from 2.0 to 2.1.0

2022-09-01 Thread GitBox


panbingkun commented on code in PR #37700:
URL: https://github.com/apache/spark/pull/37700#discussion_r961209155


##
mllib-local/pom.xml:
##
@@ -61,6 +61,11 @@
   org.apache.spark
   spark-tags_${scala.binary.version}
 
+

Review Comment:
   ok, I will replace it with Class.forName
   It works well.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] aokolnychyi commented on a diff in pull request #37749: [SPARK-40295][SQL] Allow v2 functions with literal args in write distribution/ordering

2022-09-01 Thread GitBox


aokolnychyi commented on code in PR #37749:
URL: https://github.com/apache/spark/pull/37749#discussion_r961211655


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/V2ExpressionUtils.scala:
##
@@ -105,18 +105,27 @@ object V2ExpressionUtils extends SQLConfHelper with 
Logging {
   TransformExpression(bound, resolvedRefs, Some(numBuckets))
 }
   }
-case NamedTransform(name, refs)
-if refs.length == 1 && refs.forall(_.isInstanceOf[NamedReference]) =>

Review Comment:
   I added tests but they seem to work only by accident as `satisfies0` in 
`KeyGroupedPartitioning` returns false and triggers a shuffle. We probably need 
to adapt the existing logic to be more reliable and explicit.
   
   I see two solutions:
   - Explicitly check we have flat transforms with one child ref in 
`satisfies0` in `KeyGroupedPartitioning`.
   - Don't construct `KeyGroupedPartitioning` unless we have flat transforms 
with one child ref.
   
   I am inclined to go with the first option but let me know if you have a 
better idea.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] panbingkun commented on a diff in pull request #37700: [SPARK-40251][BUILD][MLLIB] Upgrade dev.ludovic.netlib from 2.2.1 to 3.0.2 & breeze from 2.0 to 2.1.0

2022-09-01 Thread GitBox


panbingkun commented on code in PR #37700:
URL: https://github.com/apache/spark/pull/37700#discussion_r961209155


##
mllib-local/pom.xml:
##
@@ -61,6 +61,11 @@
   org.apache.spark
   spark-tags_${scala.binary.version}
 
+

Review Comment:
   I will replace it with Class.forName
   It works well.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] lyssg commented on pull request #35667: [SPARK-38425][K8S] Avoid possible errors due to incorrect file size or type supplied in hadoop conf

2022-09-01 Thread GitBox


lyssg commented on PR #35667:
URL: https://github.com/apache/spark/pull/35667#issuecomment-1234955160

   @dongjoon-hyun , @martin-g , @ScrapCodes ,I have rebased my PR again. Could 
you take another look?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] leewyang commented on a diff in pull request #37734: [SPARK-40264][ML] add batch_infer_udf function to pyspark.ml.functions

2022-09-01 Thread GitBox


leewyang commented on code in PR #37734:
URL: https://github.com/apache/spark/pull/37734#discussion_r960972099


##
python/pyspark/ml/executor_globals.py:
##
@@ -0,0 +1,24 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+# Module to hold globals for python processes on executors
+from typing import Callable, Optional
+from uuid import UUID
+
+
+model_uuid: Optional[UUID] = None
+predict_fn: Optional[Callable] = None

Review Comment:
   Right, updated the code.  ~~Probably still need to figure out a scheme for 
cache invalidation for large models and/or frequent use.~~ Update: looks like 
the python idle worker cleanup is probably sufficient here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gengliangwang opened a new pull request, #37764: [SPARK-40310][SQL] try_sum() should throw the exceptions from its child

2022-09-01 Thread GitBox


gengliangwang opened a new pull request, #37764:
URL: https://github.com/apache/spark/pull/37764

   
   
   ### What changes were proposed in this pull request?
   
   Similar to https://github.com/apache/spark/pull/37486 and 
https://github.com/apache/spark/pull/37663,  this PR refactors the 
implementation of try_sum() so that the errors from its child should be shown 
instead of ignored.
   
   ### Why are the changes needed?
   
   
   Fix the semantics of try_sum() function
   ### Does this PR introduce _any_ user-facing change?
   
   Yes, after changes, the error from the child of try_sum function will be 
shown instead of ignored.
   
   ### How was this patch tested?
   
   New UT


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] bersprockets commented on a diff in pull request #37763: [SPARK-40308][SQL] Allow non-foldable delimiter arguments to `str_to_map` function

2022-09-01 Thread GitBox


bersprockets commented on code in PR #37763:
URL: https://github.com/apache/spark/pull/37763#discussion_r961154793


##
sql/core/src/test/scala/org/apache/spark/sql/StringFunctionsSuite.scala:
##
@@ -606,6 +606,36 @@ class StringFunctionsSuite extends QueryTest with 
SharedSparkSession {
   Seq(Row(Map("a" -> "1", "b" -> "2", "c" -> "3")))
 )
 
+checkAnswer(

Review Comment:
   Not really a test of my change, but it was a missing test (the case where a 
literal pair delimiter is given but no key/value delimiter is given).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] bersprockets opened a new pull request, #37763: [SPARK-40308][SQL] Allow non-foldable delimiter arguments to `str_to_map` function

2022-09-01 Thread GitBox


bersprockets opened a new pull request, #37763:
URL: https://github.com/apache/spark/pull/37763

   ### What changes were proposed in this pull request?
   
   Remove the check for foldable delimiter arguments from 
`StringToMap#checkInputDataTypes`.
   
   Except for `checkInputDataTypes`, `StringToMap` is already able to handle 
non-foldable delimiter arguments (no other changes are required).
   
   ### Why are the changes needed?
   
   Hive 2.3.9 allows non-foldable delimiter arguments, e.g.:
   ```
   drop table if exists maptbl;
   create table maptbl as select ',' as del1, ':' as del2, 'a:1,b:2,c:3' as str;
   insert into table maptbl select '%' as del1, '-' as del2, 'a-1%b-2%c-3' as 
str;
   select str, str_to_map(str, del1, del2) from maptbl;
   ```
   This returns
   ```
   +--++
   | str  |_c1 |
   +--++
   | a:1,b:2,c:3  | {"a":"1","b":"2","c":"3"}  |
   | a-1%b-2%c-3  | {"a":"1","b":"2","c":"3"}  |
   +--++
   2 rows selected (0.13 seconds)
   ```
   However, Spark returns an error:
   ```
   str_to_map's delimiters must be foldable.; line 1 pos 12;
   ```
   
   The use-case is more likely to be something like this:
   ```
   select
 str,
 str_to_map(str, ',', if(region = 0, ':', '#')) as m
   from
 maptbl2;
   
   ```
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes, users can now specify non-foldable delimiter arguments to `str_to_map`.
   
   Literals are still accepted, so the change is backwardly compatible.
   
   ### How was this patch tested?
   
   New unit tests.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #37622: [SPARK-40187][DOCS] Add `Apache YuniKorn` scheduler docs

2022-09-01 Thread GitBox


dongjoon-hyun commented on PR #37622:
URL: https://github.com/apache/spark/pull/37622#issuecomment-1234811066

   Thank YOU, @yangwwei .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #37745: [SPARK-33605][BUILD] Add `gcs-connector` to `hadoop-cloud` module

2022-09-01 Thread GitBox


dongjoon-hyun commented on PR #37745:
URL: https://github.com/apache/spark/pull/37745#issuecomment-1234809633

   This PR passed CI here.
   - 
https://github.com/dongjoon-hyun/spark/runs/8139684479?check_suite_focus=true


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #37745: [SPARK-33605][BUILD] Add `gcs-connector` to `hadoop-cloud` module

2022-09-01 Thread GitBox


dongjoon-hyun commented on PR #37745:
URL: https://github.com/apache/spark/pull/37745#issuecomment-1234807199

   Ur, wait. @steveloughran . It's Java 8, isn't it?
   
   
https://github.com/GoogleCloudDataproc/hadoop-connectors/blob/8453ce7ce7510e983bae7470909fbd02704c0539/pom.xml#L76-L77
   
   ```
   8
   8
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun closed pull request #37745: [SPARK-33605][BUILD] Add `gcs-connector` to `hadoop-cloud` module

2022-09-01 Thread GitBox


dongjoon-hyun closed pull request #37745: [SPARK-33605][BUILD] Add 
`gcs-connector` to `hadoop-cloud` module
URL: https://github.com/apache/spark/pull/37745


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #37745: [SPARK-33605][BUILD] Add `gcs-connector` to `hadoop-cloud` module

2022-09-01 Thread GitBox


dongjoon-hyun commented on PR #37745:
URL: https://github.com/apache/spark/pull/37745#issuecomment-1234803915

   Thank you for review, @steveloughran .
   
   > note that the gcs connector (at leasts the builds off their master) are 
java 11 only; not sure where that stands w.r.t older releases 
   > note that the gcs connector (at leasts the builds off their master) are 
java 11 only; not sure where that stands w.r.t older releases
   
   I didn't realize this because I've been using Java 11+. If then, I had 
better close this PR and the JIRA officially.
   
   Thank you, @srowen , @sunchao and @steveloughran !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] linhongliu-db commented on pull request #37742: [SPARK-40291][SQL] Improve the message for column not in group by clause error

2022-09-01 Thread GitBox


linhongliu-db commented on PR #37742:
URL: https://github.com/apache/spark/pull/37742#issuecomment-1234796980

   @MaxGekk Thanks for reviewing! I updated the PR


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] aokolnychyi commented on a diff in pull request #37749: [SPARK-40295][SQL] Allow v2 functions with literal args in write distribution/ordering

2022-09-01 Thread GitBox


aokolnychyi commented on code in PR #37749:
URL: https://github.com/apache/spark/pull/37749#discussion_r961043795


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/V2ExpressionUtils.scala:
##
@@ -105,18 +105,27 @@ object V2ExpressionUtils extends SQLConfHelper with 
Logging {
   TransformExpression(bound, resolvedRefs, Some(numBuckets))
 }
   }
-case NamedTransform(name, refs)
-if refs.length == 1 && refs.forall(_.isInstanceOf[NamedReference]) =>

Review Comment:
   Let me check.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] leewyang commented on a diff in pull request #37734: [SPARK-40264][ML] add batch_infer_udf function to pyspark.ml.functions

2022-09-01 Thread GitBox


leewyang commented on code in PR #37734:
URL: https://github.com/apache/spark/pull/37734#discussion_r960986473


##
python/pyspark/ml/functions.py:
##
@@ -106,6 +112,170 @@ def array_to_vector(col: Column) -> Column:
 return 
Column(sc._jvm.org.apache.spark.ml.functions.array_to_vector(_to_java_column(col)))
 
 
+def batched(df: pd.DataFrame, batch_size: int = -1) -> Iterator[pd.DataFrame]:
+"""Generator that splits a pandas dataframe/series into batches."""
+if batch_size <= 0 or batch_size >= len(df):
+yield df
+else:
+# for batch in np.array_split(df, (len(df.index) + batch_size - 1) // 
batch_size):
+for _, batch in df.groupby(np.arange(len(df)) // batch_size):
+yield batch
+
+
+def has_tensor_cols(df: pd.DataFrame) -> bool:
+"""Check if input DataFrame contains any tensor-valued columns"""
+if any(df.dtypes == np.object_):
+# pd.DataFrame object types can contain different types, e.g. string, 
dates, etc.
+# so inspect a row and check for array/list type
+sample = df.iloc[0]
+return any([isinstance(x, np.ndarray) or isinstance(x, list) for x in 
sample])
+else:
+return False
+
+
+def batch_infer_udf(
+predict_batch_fn: Callable,
+return_type: DataType = ArrayType(FloatType()),
+batch_size: int = -1,
+input_names: list[str] = [],
+input_tensor_shapes: list[list[int]] = [],
+**kwargs: Any,
+) -> Callable:
+"""Given a function which loads a model, returns a pandas_udf for 
inferencing over that model.
+
+This will handle:
+- conversion of the Spark DataFrame to numpy arrays.
+- batching of the inputs sent to the model predict() function.
+- caching of the model and prediction function on the executors.
+
+This assumes that the `predict_batch_fn` encapsulates all of the necessary 
dependencies for
+running the model or the Spark executor environment already satisfies all 
runtime requirements.
+
+When selecting columns in pyspark SQL, users are required to always use 
`struct` for simplicity.
+
+For the conversion of Spark DataFrame to numpy, the following table 
describes the behavior,
+where tensor columns in the Spark DataFrame must be represented as a 
flattened 1-D array/list.
+
+| dataframe \\ model | single input | multiple inputs |
+| :- | :--- | :-- |
+| single-col scalar  | 1| N/A |
+| single-col tensor  | 1,2  | N/A |
+| multi-col scalar   | 3| 4   |
+| multi-col tensor   | N/A  | 4,2 |
+
+
+Notes:
+1. pass thru dataframe column => model input as single numpy array.
+2. reshape flattened tensors into expected tensor shapes.
+3. convert entire dataframe into single numpy array via df.to_numpy(), or 
user can use
+   `pyspark.sql.functions.array()` to transform the input into a 
single-col tensor first.
+4. pass thru dataframe column => model input as an (ordered) dictionary of 
numpy arrays.
+
+Parameters
+--
+predict_batch_fn : Callable
+Function which is responsible for loading a model and returning a 
`predict` function.
+return_type : DataType
+Spark SQL datatype for the expected output.
+Default: ArrayType(FloatType())
+batch_size : int
+Batch size to use for inference, note that this is typically a 
limitation of the model
+and/or the hardware resources and is usually smaller than the Spark 
partition size.
+Default: -1, which sends the entire Spark partition to the model.
+input_names: list[str]
+Optional list of input names which will be used to map DataFrame 
column names to model
+input names.  The order of names must match the order of the selected 
DataFrame columns.
+If provided, the `predict()` function will be passed a dictionary of 
named inputs.
+input_tensor_shapes: list[list[int]]
+Optional list of input tensor shapes for models with tensor inputs.  
Each tensor
+input must be represented as a single DataFrame column containing a 
flattened 1-D array.
+The order of the tensor shapes must match the order of the selected 
DataFrame columns.
+Tabular datasets with scalar-valued columns should not supply this 
argument.
+
+Returns
+---
+A pandas_udf for predicting a batch.
+"""
+# generate a new uuid each time this is invoked on the driver to 
invalidate executor-side cache.
+model_uuid = uuid.uuid4()
+
+def predict(data: Iterator[pd.DataFrame]) -> Iterator[pd.DataFrame]:
+import pyspark.ml.executor_globals as exec_global
+
+if exec_global.predict_fn and exec_global.model_uuid == model_uuid:
+predict_fn = exec_global.predict_fn
+else:
+predict_fn = predict_batch_fn(**kwargs)
+exec_global.predict_fn = 

[GitHub] [spark] MaxGekk commented on a diff in pull request #37744: [SPARK-40300][SQL] Migrate onto the `DATATYPE_MISMATCH` error class

2022-09-01 Thread GitBox


MaxGekk commented on code in PR #37744:
URL: https://github.com/apache/spark/pull/37744#discussion_r960982999


##
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/ExpressionTypeCheckingSuite.scala:
##
@@ -47,14 +47,23 @@ class ExpressionTypeCheckingSuite extends SparkFunSuite 
with SQLHelper {
 assert(e.getMessage.contains(errorMessage))
   }
 
+  def assertTypeMismatch(expr: Expression, errorMessage: String): Unit = {
+val e = intercept[AnalysisException] {
+  assertSuccess(expr)
+}
+assert(e.getMessage.contains(
+  s"""Cannot resolve "${expr.sql}" due to data type mismatch:"""))
+assert(e.getMessage.contains(errorMessage))

Review Comment:
   done



##
sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala:
##
@@ -2666,10 +2666,7 @@ class SQLQuerySuite extends QueryTest with 
SharedSparkSession with AdaptiveSpark
 val m = intercept[AnalysisException] {
   sql("SELECT * FROM t, S WHERE c = C")
 }.message
-assert(
-  m.contains(
-"cannot resolve '(spark_catalog.default.t.c = " +
-"spark_catalog.default.S.C)' due to data type mismatch"))
+assert(m.contains("""Cannot resolve "(c = C)" due to data type 
mismatch"""))

Review Comment:
   done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] yangwwei commented on pull request #37753: [SPARK-40302][K8S][TESTS] Add `YuniKornSuite`

2022-09-01 Thread GitBox


yangwwei commented on PR #37753:
URL: https://github.com/apache/spark/pull/37753#issuecomment-1234638214

   Very nice, thank you @dongjoon-hyun !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] yangwwei commented on pull request #37622: [SPARK-40187][DOCS] Add `Apache YuniKorn` scheduler docs

2022-09-01 Thread GitBox


yangwwei commented on PR #37622:
URL: https://github.com/apache/spark/pull/37622#issuecomment-1234636614

   Hi, @dongjoon-hyun  thanks a lot for helping on this.
   This is a great community collaboration between YuniKorn and Spark, thank 
you so much!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] bjornjorgensen opened a new pull request, #37762: SPARK-39996[BUILD] Upgrade 'postgresql' to 42.5.0

2022-09-01 Thread GitBox


bjornjorgensen opened a new pull request, #37762:
URL: https://github.com/apache/spark/pull/37762

   ### What changes were proposed in this pull request?
   Upgrade 'postgresql' 42.3.3 to 42.5.0
   
   
   ### Why are the changes needed?
   fix: 
[CVE-2022-31197](https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2022-31197) 
Fixes SQL generated in PgResultSet.refresh() to escape column identifiers so as 
to prevent SQL injection.
   Previously, the column names for both key and data columns in the table were 
copied as-is into the generated
   SQL. This allowed a malicious table with column names that include statement 
terminator to be parsed and
   executed as multiple separate commands.
   Also adds a new test class ResultSetRefreshTest to verify this change.
   Reported by [Sho Kato](https://github.com/kato-sho)
   
   
[Changelog](https://jdbc.postgresql.org/documentation/changelog.html#version_42.5.0)
   ### Does this PR introduce _any_ user-facing change?
   No.
   
   
   ### How was this patch tested?
   Pass GA


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] steveloughran commented on a diff in pull request #37745: [SPARK-33605][BUILD] Add `gcs-connector` to `hadoop-cloud` module

2022-09-01 Thread GitBox


steveloughran commented on code in PR #37745:
URL: https://github.com/apache/spark/pull/37745#discussion_r960971095


##
hadoop-cloud/pom.xml:
##
@@ -135,6 +135,18 @@
 
   
 
+
+  com.google.cloud.bigdataoss
+  gcs-connector
+  ${gcs-connector.version}
+  shaded
+  
+
+  *

Review Comment:
   issue is that there's a history of the shaded connector still declaring a 
dependence on things which are now shaded, so breaking convergence.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] leewyang commented on a diff in pull request #37734: [SPARK-40264][ML] add batch_infer_udf function to pyspark.ml.functions

2022-09-01 Thread GitBox


leewyang commented on code in PR #37734:
URL: https://github.com/apache/spark/pull/37734#discussion_r960972099


##
python/pyspark/ml/executor_globals.py:
##
@@ -0,0 +1,24 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+# Module to hold globals for python processes on executors
+from typing import Callable, Optional
+from uuid import UUID
+
+
+model_uuid: Optional[UUID] = None
+predict_fn: Optional[Callable] = None

Review Comment:
   Right, updated the code.  Probably still need to figure out a scheme for 
cache invalidation for large models and/or frequent use.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zero323 commented on a diff in pull request #37748: [SPARK-40210][PYTHON][CORE] Fix math atan2, hypot, pow and pmod float argument call

2022-09-01 Thread GitBox


zero323 commented on code in PR #37748:
URL: https://github.com/apache/spark/pull/37748#discussion_r960952519


##
python/pyspark/sql/functions.py:
##
@@ -108,13 +108,10 @@ def _invoke_binary_math_function(name: str, col1: Any, 
col2: Any) -> Column:
 Invokes binary JVM math function identified by name
 and wraps the result with :class:`~pyspark.sql.Column`.
 """
-return _invoke_function(
-name,
-# For legacy reasons, the arguments here can be implicitly converted 
into floats,
-# if they are not columns or strings.
-_to_java_column(col1) if isinstance(col1, (str, Column)) else 
float(col1),
-_to_java_column(col2) if isinstance(col2, (str, Column)) else 
float(col2),
-)
+
+# For legacy reasons, the arguments here can be implicitly converted into 
column
+cols = [_to_java_column(c if isinstance(c, (str, Column)) else lit(c)) for 
c in (col1, col2)]

Review Comment:
   JVM literal column can be created directly with 
[`_create_column_from_literal`](https://github.com/apache/spark/blob/fdc11ab0494a681444e7a7e13f3f99d25fa6cf2f/python/pyspark/sql/column.py#L47-L50).
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amaliujia commented on pull request #37750: [SPARK-40296] Error class for DISTINCT function not found

2022-09-01 Thread GitBox


amaliujia commented on PR #37750:
URL: https://github.com/apache/spark/pull/37750#issuecomment-1234597166

   R @MaxGekk 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sigmod commented on a diff in pull request #37697: [SPARK-40248][SQL] Use larger number of bits to build Bloom filter

2022-09-01 Thread GitBox


sigmod commented on code in PR #37697:
URL: https://github.com/apache/spark/pull/37697#discussion_r960939857


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/BloomFilterAggregate.scala:
##
@@ -55,6 +55,13 @@ case class BloomFilterAggregate(
   Multiply(estimatedNumItemsExpression, Literal(8L)))
   }
 
+  def this(child: Expression, estimatedNumItems: Long) = {
+this(child, Literal(estimatedNumItems),
+  Literal(math.min(
+BloomFilter.optimalNumOfBits(estimatedNumItems, estimatedNumItems / 
30L.toDouble),
+SQLConf.get.getConf(SQLConf.RUNTIME_BLOOM_FILTER_MAX_NUM_BITS
+  }

Review Comment:
   LGTM. Can we make sure that if estimatedNumItems >= 
RUNTIME_BLOOM_FILTER_MAX_NUM_ITEMS, it's always DEFAULT_FPP?
   
   Thanks, @wangyum !



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] santosh-d3vpl3x opened a new pull request, #37761: Add withColumnsRenamed to scala API of spark

2022-09-01 Thread GitBox


santosh-d3vpl3x opened a new pull request, #37761:
URL: https://github.com/apache/spark/pull/37761

   
   
   ### What changes were proposed in this pull request?
   This change adds an ability for code to rename multiple columns in a single 
call. 
   ```scala
   withColumnsRenamed(colsMap: Map[String, String]): DataFrame
   ```
   
   ### Why are the changes needed?
   We have seen that catalyst optimiser struggles with bigger plans. The larger 
contribution to these plans in our setup comes from `withColumnRenamed`, `drop` 
and `withColumn` being called in for loop by unknowing users. `master` branch 
of spark already has a version for handling `withColumns` and `drop` for 
multiple columns. The missing bit of the puzzle is `withColumnRenamed`.
   
   ### Does this PR introduce _any_ user-facing change?
   Yes, adds a method to efficiently rename columns in a single batch.
   
   ### How was this patch tested?
   Added unit tests
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] MaxGekk commented on a diff in pull request #37744: [SPARK-40300][SQL] Migrate onto the `DATATYPE_MISMATCH` error class

2022-09-01 Thread GitBox


MaxGekk commented on code in PR #37744:
URL: https://github.com/apache/spark/pull/37744#discussion_r960901497


##
core/src/main/resources/error/error-classes.json:
##
@@ -75,6 +75,23 @@
   "The value  () cannot be converted to  because it 
is malformed. Correct the value as per the syntax, or change its format. Use 
 to tolerate malformed input and return NULL instead."
 ]
   },
+  "DATATYPE_MISMATCH" : {
+"message" : [
+  "Cannot resolve  due to data type mismatch:"
+],
+"subClass" : {
+  "BINARY_OP_DIFF_TYPES" : {
+"message" : [
+  "the left and right arguments of the binary operator have different 
types ( and )."

Review Comment:
   One of properties of any binary operator is to have the same types of its 
inputs, see
   
https://github.com/apache/spark/blob/fab65f4e393359b9523e34457c877083dba0e7e5/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala#L738-L742
   
   This is what the error says precisely. I am not sure that we should 
introduce the term of incompatibility here. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] MaxGekk commented on a diff in pull request #37744: [SPARK-40300][SQL] Migrate onto the `DATATYPE_MISMATCH` error class

2022-09-01 Thread GitBox


MaxGekk commented on code in PR #37744:
URL: https://github.com/apache/spark/pull/37744#discussion_r960893478


##
core/src/main/resources/error/error-classes.json:
##
@@ -75,6 +75,23 @@
   "The value  () cannot be converted to  because it 
is malformed. Correct the value as per the syntax, or change its format. Use 
 to tolerate malformed input and return NULL instead."
 ]
   },
+  "DATATYPE_MISMATCH" : {
+"message" : [
+  "Cannot resolve  due to data type mismatch:"
+],
+"subClass" : {
+  "BINARY_OP_DIFF_TYPES" : {

Review Comment:
   > the problem that they are incompatible, rather than different?
   
   This PR aims to migrate on error classes but not to change the type checking 
logic. In fact, we check that types of binary operands are the same, see in the 
master
   
https://github.com/apache/spark/blob/fab65f4e393359b9523e34457c877083dba0e7e5/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala#L761-L764



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] MaxGekk commented on a diff in pull request #37744: [SPARK-40300][SQL] Migrate onto the `DATATYPE_MISMATCH` error class

2022-09-01 Thread GitBox


MaxGekk commented on code in PR #37744:
URL: https://github.com/apache/spark/pull/37744#discussion_r960894747


##
core/src/main/resources/error/error-classes.json:
##
@@ -75,6 +75,23 @@
   "The value  () cannot be converted to  because it 
is malformed. Correct the value as per the syntax, or change its format. Use 
 to tolerate malformed input and return NULL instead."
 ]
   },
+  "DATATYPE_MISMATCH" : {
+"message" : [
+  "Cannot resolve  due to data type mismatch:"
+],
+"subClass" : {
+  "BINARY_OP_DIFF_TYPES" : {
+"message" : [
+  "the left and right arguments of the binary operator have different 
types ( and )."
+]
+  },
+  "BINARY_OP_WRONG_TYPE" : {
+"message" : [
+  "the binary operator requires the input type , not 
."

Review Comment:
   First, the binary op checks that operators have the same type, and then 
check that the left is compatible with expected input type, see
   
https://github.com/apache/spark/blob/fab65f4e393359b9523e34457c877083dba0e7e5/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala#L765-L767



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] MaxGekk commented on a diff in pull request #37744: [SPARK-40300][SQL] Migrate onto the `DATATYPE_MISMATCH` error class

2022-09-01 Thread GitBox


MaxGekk commented on code in PR #37744:
URL: https://github.com/apache/spark/pull/37744#discussion_r960893478


##
core/src/main/resources/error/error-classes.json:
##
@@ -75,6 +75,23 @@
   "The value  () cannot be converted to  because it 
is malformed. Correct the value as per the syntax, or change its format. Use 
 to tolerate malformed input and return NULL instead."
 ]
   },
+  "DATATYPE_MISMATCH" : {
+"message" : [
+  "Cannot resolve  due to data type mismatch:"
+],
+"subClass" : {
+  "BINARY_OP_DIFF_TYPES" : {

Review Comment:
   > the problem that they are incompatible, rather than different?
   
   This PR aims to migrate on error classes but not to change the type checking 
logic. In fact, we check that types of binary operator are the same, see in the 
master
   
https://github.com/apache/spark/blob/fab65f4e393359b9523e34457c877083dba0e7e5/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala#L761-L764



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #37755: [SPARK-40304][K8S][TESTS] Add `decomTestTag` to K8s Integration Test

2022-09-01 Thread GitBox


dongjoon-hyun commented on PR #37755:
URL: https://github.com/apache/spark/pull/37755#issuecomment-1234520875

   Merged to master/3.3.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun closed pull request #37755: [SPARK-40304][K8S][TESTS] Add `decomTestTag` to K8s Integration Test

2022-09-01 Thread GitBox


dongjoon-hyun closed pull request #37755: [SPARK-40304][K8S][TESTS] Add 
`decomTestTag` to K8s Integration Test
URL: https://github.com/apache/spark/pull/37755


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #37755: [SPARK-40304][K8S][TESTS] Add `decomTestTag` to K8s Integration Test

2022-09-01 Thread GitBox


dongjoon-hyun commented on PR #37755:
URL: https://github.com/apache/spark/pull/37755#issuecomment-1234519012

   Thank you, @viirya . Yes, it was the same `Base Image Build` failure. After 
re-triggering, it succeeds and now linter is running.
   https://user-images.githubusercontent.com/9700541/187966369-e892c9dc-0d9b-4c56-ab90-1afda3c185b8.png;>
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] viirya commented on pull request #37755: [SPARK-40304][K8S][TESTS] Add `decomTestTag` to K8s Integration Test

2022-09-01 Thread GitBox


viirya commented on PR #37755:
URL: https://github.com/apache/spark/pull/37755#issuecomment-1234517226

   Some CI tasks were not finished normally. Seems unrelated, though.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #37753: [SPARK-40302][K8S][TESTS] Add `YuniKornSuite`

2022-09-01 Thread GitBox


dongjoon-hyun commented on PR #37753:
URL: https://github.com/apache/spark/pull/37753#issuecomment-1234513022

   Merged to master/3.3.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun closed pull request #37753: [SPARK-40302][K8S][TESTS] Add `YuniKornSuite`

2022-09-01 Thread GitBox


dongjoon-hyun closed pull request #37753: [SPARK-40302][K8S][TESTS] Add 
`YuniKornSuite`
URL: https://github.com/apache/spark/pull/37753


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #37753: [SPARK-40302][K8S][TESTS] Add `YuniKornSuite`

2022-09-01 Thread GitBox


dongjoon-hyun commented on PR #37753:
URL: https://github.com/apache/spark/pull/37753#issuecomment-1234510668

   Thank you, @viirya . Yes, there was irrelevant `Base Image Build` job 
failure and corresponding PySpark UT failures. I re-triggered.
   https://user-images.githubusercontent.com/9700541/187964842-03a3ef81-5963-4ea0-ad13-b3260ce38b59.png;>
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #37745: [SPARK-33605][BUILD] Add `gcs-connector` to `hadoop-cloud` module

2022-09-01 Thread GitBox


dongjoon-hyun commented on PR #37745:
URL: https://github.com/apache/spark/pull/37745#issuecomment-1234507890

   Thank you so much, @Yikun . Now, it seems to work on my three PRs.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #37745: [SPARK-33605][BUILD] Add `gcs-connector` to `hadoop-cloud` module

2022-09-01 Thread GitBox


dongjoon-hyun commented on code in PR #37745:
URL: https://github.com/apache/spark/pull/37745#discussion_r960841601


##
hadoop-cloud/pom.xml:
##
@@ -135,6 +135,18 @@
 
   
 
+
+  com.google.cloud.bigdataoss
+  gcs-connector
+  ${gcs-connector.version}
+  shaded
+  
+
+  *

Review Comment:
   Thank you for review, @sunchao . According to the shaded pattern,
   
   
https://github.com/GoogleCloudDataproc/hadoop-connectors/blob/8453ce7ce7510e983bae7470909fbd02704c0539/gcs/pom.xml#L208-L363
   
   We have all we needed for Hadoop3 and Hadoop2.
   - For Hadoop3, 
https://mvnrepository.com/artifact/com.google.cloud.bigdataoss/gcs-connector/hadoop3-2.2.7
   - For Hadoop2, 
https://mvnrepository.com/artifact/com.google.cloud.bigdataoss/gcs-connector/hadoop2-2.2.7
   
   I intentionally exclude everything. We will add Spark's version if there is 
additional missing transitive dependency (if exists).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sunchao commented on a diff in pull request #37745: [SPARK-33605][BUILD] Add `gcs-connector` to `hadoop-cloud` module

2022-09-01 Thread GitBox


sunchao commented on code in PR #37745:
URL: https://github.com/apache/spark/pull/37745#discussion_r960782293


##
hadoop-cloud/pom.xml:
##
@@ -135,6 +135,18 @@
 
   
 
+
+  com.google.cloud.bigdataoss
+  gcs-connector
+  ${gcs-connector.version}
+  shaded
+  
+
+  *

Review Comment:
   curious why do we exclude everything from the shaded jar



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] wangyum commented on a diff in pull request #37759: [SPARK-40306][SQL]Support more than Integer.MAX_VALUE of the same join key

2022-09-01 Thread GitBox


wangyum commented on code in PR #37759:
URL: https://github.com/apache/spark/pull/37759#discussion_r960788909


##
sql/core/src/main/scala/org/apache/spark/sql/execution/ExternalAppendOnlyUnsafeRowArray.scala:
##
@@ -76,15 +76,15 @@ private[sql] class ExternalAppendOnlyUnsafeRowArray(
 
   private var spillableArray: UnsafeExternalSorter = _
   private var totalSpillBytes: Long = 0
-  private var numRows = 0
+  private var numRows = 0L

Review Comment:
   `private var numRows` -> `private var numRows: Long =`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Yikun commented on pull request #37745: [SPARK-33605][BUILD] Add `gcs-connector` to `hadoop-cloud` module

2022-09-01 Thread GitBox


Yikun commented on PR #37745:
URL: https://github.com/apache/spark/pull/37745#issuecomment-1234426066

   
https://github.com/users/dongjoon-hyun/packages/container/apache-spark-ci-image/settings
   
   You can also remove it in page ^


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] peter-toth commented on pull request #37760: [SPARK-38404][SQL][3.3] Improve CTE resolution when a nested CTE references an outer CTE

2022-09-01 Thread GitBox


peter-toth commented on PR #37760:
URL: https://github.com/apache/spark/pull/37760#issuecomment-1234423325

   This backport is needed for 
https://github.com/apache/spark/pull/37751#issuecomment-1234336628
   
   cc @cloud-fan


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] peter-toth commented on pull request #37751: [SPARK-40297][SQL] CTE outer reference nested in CTE main body cannot be resolved

2022-09-01 Thread GitBox


peter-toth commented on PR #37751:
URL: https://github.com/apache/spark/pull/37751#issuecomment-1234422501

   @cloud-fan, here it is: https://github.com/apache/spark/pull/37760


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] peter-toth opened a new pull request, #37760: [SPARK-38404][SQL][3.3] Improve CTE resolution when a nested CTE references an outer CTE

2022-09-01 Thread GitBox


peter-toth opened a new pull request, #37760:
URL: https://github.com/apache/spark/pull/37760

   ### What changes were proposed in this pull request?
   Please note that the bug in the 
[SPARK-38404](https://issues.apache.org/jira/browse/SPARK-38404) is fixed 
already with https://github.com/apache/spark/pull/34929.
   This PR is a minor improvement to the current implementation by collecting 
already resolved outer CTEs to avoid re-substituting already collected CTE 
definitions.
   
   ### Why are the changes needed?
   Small improvement + additional tests.
   
   ### Does this PR introduce _any_ user-facing change?
   No.
   
   ### How was this patch tested?
   Added new test case.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] wankunde commented on a diff in pull request #37533: [SPARK-40096]Fix finalize shuffle stage slow due to connection creation slow

2022-09-01 Thread GitBox


wankunde commented on code in PR #37533:
URL: https://github.com/apache/spark/pull/37533#discussion_r960777861


##
core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala:
##
@@ -2242,60 +2251,57 @@ private[spark] class DAGScheduler(
 val numMergers = stage.shuffleDep.getMergerLocs.length
 val results = (0 until numMergers).map(_ => 
SettableFuture.create[Boolean]())
 externalShuffleClient.foreach { shuffleClient =>
-  if (!registerMergeResults) {
-results.foreach(_.set(true))
-// Finalize in separate thread as shuffle merge is a no-op in this case
-shuffleMergeFinalizeScheduler.schedule(new Runnable {
-  override def run(): Unit = {
-stage.shuffleDep.getMergerLocs.foreach {
-  case shuffleServiceLoc =>
-// Sends async request to shuffle service to finalize shuffle 
merge on that host.
-// Since merge statuses will not be registered in this case,
-// we pass a no-op listener.
-shuffleClient.finalizeShuffleMerge(shuffleServiceLoc.host,
-  shuffleServiceLoc.port, shuffleId, shuffleMergeId,
-  new MergeFinalizerListener {
-override def onShuffleMergeSuccess(statuses: 
MergeStatuses): Unit = {
-}
+  val scheduledFutures = stage.shuffleDep.getMergerLocs.zipWithIndex.map {
+case (shuffleServiceLoc, index) =>
+  // Sends async request to shuffle service to finalize shuffle merge 
on that host
+  // TODO: SPARK-35536: Cancel finalizeShuffleMerge if the stage is 
cancelled
+  // TODO: during shuffleMergeFinalizeWaitSec
+  shuffleSendFinalizeRPCExecutor.submit(new Runnable() {
+override def run(): Unit = {
+  shuffleClient.finalizeShuffleMerge(shuffleServiceLoc.host,
+shuffleServiceLoc.port, shuffleId, shuffleMergeId,
+new MergeFinalizerListener {
+  override def onShuffleMergeSuccess(statuses: MergeStatuses): 
Unit = {
+assert(shuffleId == statuses.shuffleId)
+eventProcessLoop.post(RegisterMergeStatuses(stage, 
MergeStatus.
+  convertMergeStatusesToMergeStatusArr(statuses, 
shuffleServiceLoc)))
+results(index).set(true)
+  }

Review Comment:
   Thanks for your explanation,  I have updated the PR.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] wankunde opened a new pull request, #37759: [SPARK-40306][SQL]Support more than Integer.MAX_VALUE of the same join key

2022-09-01 Thread GitBox


wankunde opened a new pull request, #37759:
URL: https://github.com/apache/spark/pull/37759

   
   
   
   ### What changes were proposed in this pull request?
   
   Support more than Integer.MAX_VALUE of the same join key.
   
   ### Why are the changes needed?
   
   For SMJ, the number of the same join key records of the right table is 
greater than Integer.MAX_VALUE, the result will be incorrect.
   Before SMJ JOIN, we will put the records of the same join key into the 
ExternalAppendOnlyUnsafeRowArray. ExternalAppendOnlyUnsafeRowArray.numRows 
overflow may cause OOM. During SMJ JOIN, SpillableArrayIterator.startIndex 
overflow may cause incorrect result.
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   No
   
   ### How was this patch tested?
   
   Manual test.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #37745: [SPARK-33605][BUILD] Add `gcs-connector` to `hadoop-cloud` module

2022-09-01 Thread GitBox


dongjoon-hyun commented on PR #37745:
URL: https://github.com/apache/spark/pull/37745#issuecomment-1234388728

   1. I checked that mine is the same with you.
   https://user-images.githubusercontent.com/9700541/187943970-bd5d40bf-8545-4d50-b7eb-16fc4a0440d8.png;>
   
   2. Let me try to clean up
   ```
   curl -X DELETE -H "Accept: application/vnd.github+json" -H "Authorization: 
token $REPLACE_ME" 
https://api.github.com/user/packages/container/apache-spark-ci-image
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] LuciferYang commented on a diff in pull request #37700: [SPARK-40251][BUILD][MLLIB] Upgrade dev.ludovic.netlib from 2.2.1 to 3.0.2 & breeze from 2.0 to 2.1.0

2022-09-01 Thread GitBox


LuciferYang commented on code in PR #37700:
URL: https://github.com/apache/spark/pull/37700#discussion_r960736796


##
mllib-local/pom.xml:
##
@@ -61,6 +61,11 @@
   org.apache.spark
   spark-tags_${scala.binary.version}
 
+

Review Comment:
   > We can't do it for that. Just inline what you need. There may be a 'copy' 
in an MLlib utils class even, not sure
   
   +1, Agree



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] wangyum commented on a diff in pull request #37697: [SPARK-40248][SQL] Use larger number of bits to build Bloom filter

2022-09-01 Thread GitBox


wangyum commented on code in PR #37697:
URL: https://github.com/apache/spark/pull/37697#discussion_r960736515


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/BloomFilterAggregate.scala:
##
@@ -55,6 +55,13 @@ case class BloomFilterAggregate(
   Multiply(estimatedNumItemsExpression, Literal(8L)))
   }
 
+  def this(child: Expression, estimatedNumItems: Long) = {
+this(child, Literal(estimatedNumItems),
+  Literal(math.min(
+BloomFilter.optimalNumOfBits(estimatedNumItems, estimatedNumItems / 
30L.toDouble),
+SQLConf.get.getConf(SQLConf.RUNTIME_BLOOM_FILTER_MAX_NUM_BITS
+  }

Review Comment:
   How about?
   ```scala
   this(child, Literal(estimatedNumItems),
 Literal(math.min(
   BloomFilter.optimalNumOfBits(estimatedNumItems,
 estimatedNumItems / 
(SQLConf.get.getConf(SQLConf.RUNTIME_BLOOM_FILTER_MAX_NUM_ITEMS) /
   BloomFilter.DEFAULT_FPP)),
   SQLConf.get.getConf(SQLConf.RUNTIME_BLOOM_FILTER_MAX_NUM_BITS
   ```
   The smaller `estimatedNumItems`, the smaller the `FPP`.
   
   
   estimatedNumItems | FPP | numBits
   -- | -- | --
   RUNTIME_BLOOM_FILTER_MAX_NUM_ITEMS | DEFAULT_FPP | 29193763
   200 | 0.015 | 17482271
   100 | 0.0075 | 10183830
   10 | 7.50E-04 | 1497636
   1 | 7.50E-05 | 197688
   1000 | 7.50E-06 | 24561
   100 | 7.50E-07 | 2935
   10 | 7.50E-08 | 341
   
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] srowen commented on a diff in pull request #37700: [SPARK-40251][BUILD][MLLIB] Upgrade dev.ludovic.netlib from 2.2.1 to 3.0.2 & breeze from 2.0 to 2.1.0

2022-09-01 Thread GitBox


srowen commented on code in PR #37700:
URL: https://github.com/apache/spark/pull/37700#discussion_r960734725


##
mllib-local/pom.xml:
##
@@ -61,6 +61,11 @@
   org.apache.spark
   spark-tags_${scala.binary.version}
 
+

Review Comment:
   We can't do it for that. Just inline what you need. There may be a 'copy' in 
an MLlib utils class even, not sure 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] peter-toth commented on pull request #37751: [SPARK-40297][SQL] CTE outer reference nested in CTE main body cannot be resolved

2022-09-01 Thread GitBox


peter-toth commented on PR #37751:
URL: https://github.com/apache/spark/pull/37751#issuecomment-1234357072

   > It has conflicts in 3.3, due to missing #36146 . @peter-toth can you help 
to open a backport PR for it? Thanks!
   
   Sure, I can open it today.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] srielau commented on a diff in pull request #37744: [SPARK-40300][SQL] Migrate onto the `DATATYPE_MISMATCH` error class

2022-09-01 Thread GitBox


srielau commented on code in PR #37744:
URL: https://github.com/apache/spark/pull/37744#discussion_r960719687


##
core/src/main/resources/error/error-classes.json:
##
@@ -75,6 +75,23 @@
   "The value  () cannot be converted to  because it 
is malformed. Correct the value as per the syntax, or change its format. Use 
 to tolerate malformed input and return NULL instead."
 ]
   },
+  "DATATYPE_MISMATCH" : {
+"message" : [
+  "Cannot resolve  due to data type mismatch:"
+],
+"subClass" : {
+  "BINARY_OP_DIFF_TYPES" : {
+"message" : [
+  "the left and right arguments of the binary operator have different 
types ( and )."
+]
+  },
+  "BINARY_OP_WRONG_TYPE" : {
+"message" : [
+  "the binary operator requires the input type , not 
."

Review Comment:
   Are you sure this is about the LEFT type?
   Aren't those non-symetrical operators deriving the right type from the left 
type?




##
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/ExpressionTypeCheckingSuite.scala:
##
@@ -47,14 +47,23 @@ class ExpressionTypeCheckingSuite extends SparkFunSuite 
with SQLHelper {
 assert(e.getMessage.contains(errorMessage))
   }
 
+  def assertTypeMismatch(expr: Expression, errorMessage: String): Unit = {
+val e = intercept[AnalysisException] {
+  assertSuccess(expr)
+}
+assert(e.getMessage.contains(
+  s"""Cannot resolve "${expr.sql}" due to data type mismatch:"""))
+assert(e.getMessage.contains(errorMessage))

Review Comment:
   Use checkError(), don't test the text



##
sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala:
##
@@ -2666,10 +2666,7 @@ class SQLQuerySuite extends QueryTest with 
SharedSparkSession with AdaptiveSpark
 val m = intercept[AnalysisException] {
   sql("SELECT * FROM t, S WHERE c = C")
 }.message
-assert(
-  m.contains(
-"cannot resolve '(spark_catalog.default.t.c = " +
-"spark_catalog.default.S.C)' due to data type mismatch"))
+assert(m.contains("""Cannot resolve "(c = C)" due to data type 
mismatch"""))

Review Comment:
   checkError()



##
core/src/main/resources/error/error-classes.json:
##
@@ -75,6 +75,23 @@
   "The value  () cannot be converted to  because it 
is malformed. Correct the value as per the syntax, or change its format. Use 
 to tolerate malformed input and return NULL instead."
 ]
   },
+  "DATATYPE_MISMATCH" : {
+"message" : [
+  "Cannot resolve  due to data type mismatch:"
+],
+"subClass" : {
+  "BINARY_OP_DIFF_TYPES" : {
+"message" : [
+  "the left and right arguments of the binary operator have different 
types ( and )."

Review Comment:
   ```suggestion
 "the left and right operands of the binary operator have 
incompatible types ( and )."
   ```



##
core/src/main/resources/error/error-classes.json:
##
@@ -75,6 +75,23 @@
   "The value  () cannot be converted to  because it 
is malformed. Correct the value as per the syntax, or change its format. Use 
 to tolerate malformed input and return NULL instead."
 ]
   },
+  "DATATYPE_MISMATCH" : {
+"message" : [
+  "Cannot resolve  due to data type mismatch:"
+],
+"subClass" : {
+  "BINARY_OP_DIFF_TYPES" : {

Review Comment:
   Isn't. the problem that they are incompatible, rather than different? We 
will happily divide an integer by a float. We even divide an interval by a an 
integer
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on pull request #37751: [SPARK-40297][SQL] CTE outer reference nested in CTE main body cannot be resolved

2022-09-01 Thread GitBox


cloud-fan commented on PR #37751:
URL: https://github.com/apache/spark/pull/37751#issuecomment-1234336628

   It has conflicts in 3.3, due to missing 
https://github.com/apache/spark/pull/36146 . @peter-toth can you help to open a 
backport PR for it? Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on pull request #37751: [SPARK-40297][SQL] CTE outer reference nested in CTE main body cannot be resolved

2022-09-01 Thread GitBox


cloud-fan commented on PR #37751:
URL: https://github.com/apache/spark/pull/37751#issuecomment-1234333082

   thanks, merging to master!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan closed pull request #37751: [SPARK-40297][SQL] CTE outer reference nested in CTE main body cannot be resolved

2022-09-01 Thread GitBox


cloud-fan closed pull request #37751: [SPARK-40297][SQL] CTE outer reference 
nested in CTE main body cannot be resolved
URL: https://github.com/apache/spark/pull/37751


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] panbingkun commented on a diff in pull request #37700: [SPARK-40251][BUILD][MLLIB] Upgrade dev.ludovic.netlib from 2.2.1 to 3.0.2 & breeze from 2.0 to 2.1.0

2022-09-01 Thread GitBox


panbingkun commented on code in PR #37700:
URL: https://github.com/apache/spark/pull/37700#discussion_r960672738


##
mllib-local/pom.xml:
##
@@ -61,6 +61,11 @@
   org.apache.spark
   spark-tags_${scala.binary.version}
 
+

Review Comment:
   for Utils.classForName



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on pull request #37750: [SPARK-40296] Error class for DISTINCT function not found

2022-09-01 Thread GitBox


cloud-fan commented on PR #37750:
URL: https://github.com/apache/spark/pull/37750#issuecomment-1234297015

   cc @MaxGekk 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] srowen closed pull request #37731: [SPARK-40279][DOC] Document spark.yarn.report.interval

2022-09-01 Thread GitBox


srowen closed pull request #37731: [SPARK-40279][DOC] Document 
spark.yarn.report.interval
URL: https://github.com/apache/spark/pull/37731


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] srowen commented on pull request #37731: [SPARK-40279][DOC] Document spark.yarn.report.interval

2022-09-01 Thread GitBox


srowen commented on PR #37731:
URL: https://github.com/apache/spark/pull/37731#issuecomment-1234285256

   Merged to master


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] srowen commented on a diff in pull request #37700: [SPARK-40251][BUILD][MLLIB] Upgrade dev.ludovic.netlib from 2.2.1 to 3.0.2 & breeze from 2.0 to 2.1.0

2022-09-01 Thread GitBox


srowen commented on code in PR #37700:
URL: https://github.com/apache/spark/pull/37700#discussion_r960657137


##
mllib-local/pom.xml:
##
@@ -61,6 +61,11 @@
   org.apache.spark
   spark-tags_${scala.binary.version}
 
+

Review Comment:
   Oh, I don't think we want to introduce this circular dependency. What's it 
for?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Ngone51 commented on a diff in pull request #37411: [SPARK-39984][CORE] Check workerLastHeartbeat with master before HeartbeatReceiver expires an executor

2022-09-01 Thread GitBox


Ngone51 commented on code in PR #37411:
URL: https://github.com/apache/spark/pull/37411#discussion_r960646158


##
core/src/main/scala/org/apache/spark/HeartbeatReceiver.scala:
##
@@ -77,17 +77,61 @@ private[spark] class HeartbeatReceiver(sc: SparkContext, 
clock: Clock)
 
   private[spark] var scheduler: TaskScheduler = null
 
-  // executor ID -> timestamp of when the last heartbeat from this executor 
was received
+  /**
+   * [SPARK-39984]
+   * Please make sure the intersection between `executorLastSeen` and 
`executorExpiryCandidates` is
+   * an empty set. If the intersection is not empty, it is possible to never 
kill the executor until
+   * the executor recovers. When an executor is in both `executorLastSeen` and
+   * `executorExpiryCandidates`, the value of `workerLastHeartbeat` in 
`executorExpiryCandidates`
+   * may update if the worker sends heartbeats to master normally.
+   *
+   * `executorLastSeen`:
+   *  - key: executor ID
+   *  - value: timestamp of when the last heartbeat from this executor was 
received
+   *
+   * `executorExpiryCandidates`: executor ID -> WorkerLastHeartbeat
+   *  - key: executor ID
+   *  - value: timestamp of when the last heartbeat from the worker was 
received
+   *
+   * when driver does not receive any heartbeat from an executor for 
`executorTimeoutMs` seconds,
+   * the driver will ask master for the last heartbeat from the worker which 
the executor is running
+   * on.
+   */
   private val executorLastSeen = new HashMap[String, Long]
+  private val executorExpiryCandidates = new HashMap[String, Long]
 
   private val executorTimeoutMs = sc.conf.get(
 config.STORAGE_BLOCKMANAGER_HEARTBEAT_TIMEOUT
-  
).getOrElse(Utils.timeStringAsMs(s"${sc.conf.get(Network.NETWORK_TIMEOUT)}s"))
+  ).getOrElse(
+sc.conf.get(Network.NETWORK_EXECUTOR_TIMEOUT) match {
+  case Some(executorTimeout) => executorTimeout
+  case None => 
Utils.timeStringAsMs(s"${sc.conf.get(Network.NETWORK_TIMEOUT)}s")
+}
+  )
 
   private val checkTimeoutIntervalMs = 
sc.conf.get(Network.NETWORK_TIMEOUT_INTERVAL)
 
   private val executorHeartbeatIntervalMs = 
sc.conf.get(config.EXECUTOR_HEARTBEAT_INTERVAL)
 
+  /**
+   * Currently, [SPARK-39984] is only for StandaloneSchedulerBackend.
+   *
+   * `checkWorkerLastHeartbeat`: A flag to enable two-phase executor timeout.
+   * `expiryCandidatesTimeout`: The timeout used for executorExpiryCandidates.
+   */
+  private val checkWorkerLastHeartbeat = {

Review Comment:
   > However, if the first expireDeadHosts trigger is prior to scheduler 
backend initialization, the value of checkWorkerLastHeartbeat will be false
   
   I thought there would be an initial delay for the first `expireDeadHosts` so 
`checkWorkerLastHeartbeat` is less likely to return false. But seems it doesn't 
have the initial delay, though I think it doesn't make sense.
   
   I actually think we can set the initial delay to its check interval: 
`checkTimeoutIntervalMs`.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng commented on pull request #37739: [SPARK-40265][PS] Fix the inconsistent behavior for Index.intersection.

2022-09-01 Thread GitBox


zhengruifeng commented on PR #37739:
URL: https://github.com/apache/spark/pull/37739#issuecomment-1234230281

   what if the `psidx` itself is a `MultiIndex`?
   
   ```
   >>> psidx
   Int64Index([1, 2, 3, 4], dtype='int64', name='Koalas')
   >>> psidx.intersection([(1, 2), (3, 4)]).sort_values()
   MultiIndex([], )
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on pull request #37758: [SPARK-40149][SQL] Propagate metadata columns through Project

2022-09-01 Thread GitBox


cloud-fan commented on PR #37758:
URL: https://github.com/apache/spark/pull/37758#issuecomment-1234209788

   cc @karenfeng @viirya @huaxingao 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan opened a new pull request, #37758: [SPARK-40149][SQL] Propagate metadata columns through Project

2022-09-01 Thread GitBox


cloud-fan opened a new pull request, #37758:
URL: https://github.com/apache/spark/pull/37758

   
   
   ### What changes were proposed in this pull request?
   
   This PR fixes a regression caused by 
https://github.com/apache/spark/pull/32017 .
   
   In https://github.com/apache/spark/pull/32017 , we tried to be more 
conservative and decided to not propagate metadata columns in certain 
operators, including `Project`. However, the decision was made only considering 
SQL API, not DataFrame API. In fact, it's very common to chain `Project` 
operators in DataFrame, e.g. `df.withColumn(...).withColumn(...)...`, and it's 
very inconvenient if metadata columns are not propagated through `Project`.
   
   This PR makes 2 changes:
   1. Project should propagate metadata columns
   2. SubqueryAlias should only propagate metadata columns if the child is a 
leaf node or also a SubqueryAlias
   
   The second change is needed to still forbid weird queries like `SELECT m 
from (SELECT a from t)`, which is the main motivation of 
https://github.com/apache/spark/pull/32017 .
   
   ### Why are the changes needed?
   
   fix a regression
   
   ### Does this PR introduce _any_ user-facing change?
   
   For SQL API, there is no change, as a `SubqueryAlias` always comes with a 
`Project` or `Aggregate`, so we still don't propagate metadata columns through 
a SELECT group.
   
   For DataFrame API, the behavior becomes more lenient. The only breaking case 
is an operator that can propagate metadata columns then follows a 
`SubqueryAlias`, e.g. `df.filter(...).as("t").select("t.metadata_col")`. But 
this is a weird use case and I don't think we should support it at the first 
place.
   
   ### How was this patch tested?
   
   new tests


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] gitlabsam opened a new pull request, #37757: Branch 3.3 sam

2022-09-01 Thread GitBox


gitlabsam opened a new pull request, #37757:
URL: https://github.com/apache/spark/pull/37757

   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng opened a new pull request, #37756: [SPARK-40305][PS] Implement Groupby.sem

2022-09-01 Thread GitBox


zhengruifeng opened a new pull request, #37756:
URL: https://github.com/apache/spark/pull/37756

   ### What changes were proposed in this pull request?
   Implement Groupby.sem
   
   
   ### Why are the changes needed?
   to increase API coverage
   
   
   ### Does this PR introduce _any_ user-facing change?
   yes, new API
   
   
   ### How was this patch tested?
   added UT


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] WeichenXu123 commented on pull request #37734: [SPARK-40264][ML] add batch_infer_udf function to pyspark.ml.functions

2022-09-01 Thread GitBox


WeichenXu123 commented on PR #37734:
URL: https://github.com/apache/spark/pull/37734#issuecomment-1234139636

   But I think we'd better design and discuss the API first. @mengxr Do you 
have any suggestions ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] WeichenXu123 commented on a diff in pull request #37734: [SPARK-40264][ML] add batch_infer_udf function to pyspark.ml.functions

2022-09-01 Thread GitBox


WeichenXu123 commented on code in PR #37734:
URL: https://github.com/apache/spark/pull/37734#discussion_r960529530


##
python/pyspark/ml/executor_globals.py:
##
@@ -0,0 +1,24 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+# Module to hold globals for python processes on executors
+from typing import Callable, Optional
+from uuid import UUID
+
+
+model_uuid: Optional[UUID] = None
+predict_fn: Optional[Callable] = None

Review Comment:
   This is not the correct way of building cache:
   It is not thread safe. Many concurrent spark tasks (might comes from 
different spark job) might be spawn on executor side.
   
   To cache model, you can copy mlflow code of:
   
https://github.com/mlflow/mlflow/blob/master/mlflow/pyfunc/spark_model_cache.py
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zero323 commented on a diff in pull request #37748: [SPARK-40210][PYTHON][CORE] Fix math atan2, hypot, pow and pmod float argument call

2022-09-01 Thread GitBox


zero323 commented on code in PR #37748:
URL: https://github.com/apache/spark/pull/37748#discussion_r960525054


##
python/pyspark/sql/functions.py:
##
@@ -108,12 +108,13 @@ def _invoke_binary_math_function(name: str, col1: Any, 
col2: Any) -> Column:
 Invokes binary JVM math function identified by name
 and wraps the result with :class:`~pyspark.sql.Column`.
 """
+
+def align_type(c: Any) -> Union[str, Column]:
+# For legacy reasons, the arguments here can be implicitly converted 
into column
+return c if isinstance(c, (str, Column)) else lit(c)

Review Comment:
   Unless we plan to reuse this in the future, I'd make more senses to just 
inline the expression.
   
   
   If we do plan to reuse it, this should be marked as internal (`align_type` 
-> `_align_type`) and moved next to other helper functions (also, a docstring 
would be a good idea).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] WeichenXu123 commented on a diff in pull request #37734: [SPARK-40264][ML] add batch_infer_udf function to pyspark.ml.functions

2022-09-01 Thread GitBox


WeichenXu123 commented on code in PR #37734:
URL: https://github.com/apache/spark/pull/37734#discussion_r960524940


##
python/pyspark/ml/functions.py:
##
@@ -106,6 +112,170 @@ def array_to_vector(col: Column) -> Column:
 return 
Column(sc._jvm.org.apache.spark.ml.functions.array_to_vector(_to_java_column(col)))
 
 
+def batched(df: pd.DataFrame, batch_size: int = -1) -> Iterator[pd.DataFrame]:
+"""Generator that splits a pandas dataframe/series into batches."""
+if batch_size <= 0 or batch_size >= len(df):
+yield df
+else:
+# for batch in np.array_split(df, (len(df.index) + batch_size - 1) // 
batch_size):
+for _, batch in df.groupby(np.arange(len(df)) // batch_size):
+yield batch
+
+
+def has_tensor_cols(df: pd.DataFrame) -> bool:
+"""Check if input DataFrame contains any tensor-valued columns"""
+if any(df.dtypes == np.object_):
+# pd.DataFrame object types can contain different types, e.g. string, 
dates, etc.
+# so inspect a row and check for array/list type
+sample = df.iloc[0]
+return any([isinstance(x, np.ndarray) or isinstance(x, list) for x in 
sample])
+else:
+return False
+
+
+def batch_infer_udf(
+predict_batch_fn: Callable,
+return_type: DataType = ArrayType(FloatType()),
+batch_size: int = -1,
+input_names: list[str] = [],
+input_tensor_shapes: list[list[int]] = [],
+**kwargs: Any,
+) -> Callable:
+"""Given a function which loads a model, returns a pandas_udf for 
inferencing over that model.
+
+This will handle:
+- conversion of the Spark DataFrame to numpy arrays.
+- batching of the inputs sent to the model predict() function.
+- caching of the model and prediction function on the executors.
+
+This assumes that the `predict_batch_fn` encapsulates all of the necessary 
dependencies for
+running the model or the Spark executor environment already satisfies all 
runtime requirements.
+
+When selecting columns in pyspark SQL, users are required to always use 
`struct` for simplicity.
+
+For the conversion of Spark DataFrame to numpy, the following table 
describes the behavior,
+where tensor columns in the Spark DataFrame must be represented as a 
flattened 1-D array/list.
+
+| dataframe \\ model | single input | multiple inputs |
+| :- | :--- | :-- |
+| single-col scalar  | 1| N/A |
+| single-col tensor  | 1,2  | N/A |
+| multi-col scalar   | 3| 4   |
+| multi-col tensor   | N/A  | 4,2 |
+
+
+Notes:
+1. pass thru dataframe column => model input as single numpy array.
+2. reshape flattened tensors into expected tensor shapes.
+3. convert entire dataframe into single numpy array via df.to_numpy(), or 
user can use
+   `pyspark.sql.functions.array()` to transform the input into a 
single-col tensor first.
+4. pass thru dataframe column => model input as an (ordered) dictionary of 
numpy arrays.
+
+Parameters
+--
+predict_batch_fn : Callable
+Function which is responsible for loading a model and returning a 
`predict` function.
+return_type : DataType
+Spark SQL datatype for the expected output.
+Default: ArrayType(FloatType())
+batch_size : int
+Batch size to use for inference, note that this is typically a 
limitation of the model
+and/or the hardware resources and is usually smaller than the Spark 
partition size.
+Default: -1, which sends the entire Spark partition to the model.
+input_names: list[str]
+Optional list of input names which will be used to map DataFrame 
column names to model
+input names.  The order of names must match the order of the selected 
DataFrame columns.
+If provided, the `predict()` function will be passed a dictionary of 
named inputs.
+input_tensor_shapes: list[list[int]]
+Optional list of input tensor shapes for models with tensor inputs.  
Each tensor
+input must be represented as a single DataFrame column containing a 
flattened 1-D array.
+The order of the tensor shapes must match the order of the selected 
DataFrame columns.
+Tabular datasets with scalar-valued columns should not supply this 
argument.
+
+Returns
+---
+A pandas_udf for predicting a batch.
+"""
+# generate a new uuid each time this is invoked on the driver to 
invalidate executor-side cache.
+model_uuid = uuid.uuid4()
+
+def predict(data: Iterator[pd.DataFrame]) -> Iterator[pd.DataFrame]:
+import pyspark.ml.executor_globals as exec_global
+
+if exec_global.predict_fn and exec_global.model_uuid == model_uuid:
+predict_fn = exec_global.predict_fn
+else:
+predict_fn = predict_batch_fn(**kwargs)
+

[GitHub] [spark] Yikun commented on pull request #37745: [SPARK-33605][BUILD] Add `gcs-connector` to `hadoop-cloud` module

2022-09-01 Thread GitBox


Yikun commented on PR #37745:
URL: https://github.com/apache/spark/pull/37745#issuecomment-1234097485

   The potential issue might be you remove the old repo, but the images is not 
be deleted, then when create the new repo, the write permisson of this image 
are not configured to new repo.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Yikun commented on pull request #37745: [SPARK-33605][BUILD] Add `gcs-connector` to `hadoop-cloud` module

2022-09-01 Thread GitBox


Yikun commented on PR #37745:
URL: https://github.com/apache/spark/pull/37745#issuecomment-1234081059

   Could you check this link?
   
   
https://github.com/users/dongjoon-hyun/packages/container/package/apache-spark-ci-image/settings
   
   
![image](https://user-images.githubusercontent.com/1736354/187893383-37752514-c9be-4f3b-bb53-a3f8cdc3e25c.png)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #37745: [SPARK-33605][BUILD] Add `gcs-connector` to `hadoop-cloud` module

2022-09-01 Thread GitBox


dongjoon-hyun commented on PR #37745:
URL: https://github.com/apache/spark/pull/37745#issuecomment-1234077478

   It's weird. IIRC, I didn't change anything from my previous repo either when 
your PR applied this change.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #37745: [SPARK-33605][BUILD] Add `gcs-connector` to `hadoop-cloud` module

2022-09-01 Thread GitBox


dongjoon-hyun commented on PR #37745:
URL: https://github.com/apache/spark/pull/37745#issuecomment-1234071272

   I'm already allowing all of them.
   https://user-images.githubusercontent.com/9700541/187891526-5938feb5-d380-4574-a81a-9b621779dead.png;>
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #37755: [SPARK-40304][K8S][TESTS] Add `decomTestTag` to K8s Integration Test

2022-09-01 Thread GitBox


dongjoon-hyun commented on PR #37755:
URL: https://github.com/apache/spark/pull/37755#issuecomment-1234068985

   Could you review this please, @viirya ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Yikun commented on pull request #37745: [SPARK-33605][BUILD] Add `gcs-connector` to `hadoop-cloud` module

2022-09-01 Thread GitBox


Yikun commented on PR #37745:
URL: https://github.com/apache/spark/pull/37745#issuecomment-1234067056

   https://github.com/dongjoon-hyun/spark/settings/actions
   
   
![image](https://user-images.githubusercontent.com/1736354/187890839-2f26ce10-2e20-4d7e-ab6e-311c898fc416.png)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun opened a new pull request, #37755: [SPARK-40304][K8S][TESTS] Add decomTestTag to K8s Integration Test

2022-09-01 Thread GitBox


dongjoon-hyun opened a new pull request, #37755:
URL: https://github.com/apache/spark/pull/37755

   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Yikun commented on pull request #37745: [SPARK-33605][BUILD] Add `gcs-connector` to `hadoop-cloud` module

2022-09-01 Thread GitBox


Yikun commented on PR #37745:
URL: https://github.com/apache/spark/pull/37745#issuecomment-1234063012

   @dongjoon-hyun I just saw your recreate the spark repo, so might default 
permisson has some changes on Github Action?
   
   You could first set permission for your dongjoon-hyun/spark repo: 
https://github.blog/changelog/2021-04-20-github-actions-control-permissions-for-github_token/#setting-the-default-permissions-for-the-organization-or-repository
   
   and we might need a separate pr to set spark permission for new created 
repo: 
https://github.blog/changelog/2021-04-20-github-actions-control-permissions-for-github_token/#setting-permissions-in-the-workflow


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #37753: [SPARK-40302][K8S][TESTS] Add `YuniKornSuite`

2022-09-01 Thread GitBox


dongjoon-hyun commented on PR #37753:
URL: https://github.com/apache/spark/pull/37753#issuecomment-1234062168

   Could you review this, @viirya ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] MaxGekk commented on a diff in pull request #37742: [SPARK-40291][SQL] Improve the message for column not in group by clause error

2022-09-01 Thread GitBox


MaxGekk commented on code in PR #37742:
URL: https://github.com/apache/spark/pull/37742#discussion_r960464975


##
sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryCompilationErrors.scala:
##
@@ -2527,4 +2527,11 @@ private[sql] object QueryCompilationErrors extends 
QueryErrorsBase {
   errorClass = "INVALID_COLUMN_OR_FIELD_DATA_TYPE",
   messageParameters = Array(toSQLId(name), toSQLType(dt), 
toSQLType(expected)))
   }
+
+  def columnNotInGroupByClauseError(expression: Expression): Throwable = {
+new AnalysisException(
+  errorClass = "COLUMN_NOT_IN_GROUP_BY_CLAUSE",
+  messageParameters = Array(expression.sql)

Review Comment:
   expression.sql -> toSQLExpr(expression)



##
core/src/main/resources/error/error-classes.json:
##
@@ -65,6 +65,12 @@
 ],
 "sqlState" : "22005"
   },
+  "COLUMN_NOT_IN_GROUP_BY_CLAUSE" : {
+"message" : [
+  "expression '' is neither present in the group by, nor is it 
an aggregate function. Add to group by or wrap in first() (or first_value) if 
you don't care which value you get."

Review Comment:
   Please, remove '' around  and use toSQLExpr(), and first() 
should be quoted by back ticks. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #37745: [SPARK-33605][BUILD] Add `gcs-connector` to `hadoop-cloud` module

2022-09-01 Thread GitBox


dongjoon-hyun commented on PR #37745:
URL: https://github.com/apache/spark/pull/37745#issuecomment-1234055469

   Thank you, but `Base Image Build` phase failed three times already .
   https://user-images.githubusercontent.com/9700541/187888992-48c0292b-2586-421b-8f9e-9b514ab35cb2.png;>
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] MaxGekk commented on pull request #37744: [SPARK-40300][SQL] Migrate onto the `DATATYPE_MISMATCH` error class

2022-09-01 Thread GitBox


MaxGekk commented on PR #37744:
URL: https://github.com/apache/spark/pull/37744#issuecomment-1234048716

   The test failure is not related to this PR, I believe:
   ```
   YarnClusterSuite.run Spark in yarn-client mode with different 
configurations, ensuring redaction
   ```
   @cloud-fan @HyukjinKwon @gengliangwang Could you review the PR, please. It 
migrates a bunch of AnalysisException in golden files to error classes. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Yikun commented on pull request #37745: [SPARK-33605][BUILD] Add `gcs-connector` to `hadoop-cloud` module

2022-09-01 Thread GitBox


Yikun commented on PR #37745:
URL: https://github.com/apache/spark/pull/37745#issuecomment-1234041557

   @dongjoon-hyun Thanks to ping me, this due to github action ghcr unstable, 
you could retry to make it work.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #37622: [SPARK-40187][DOCS] Add `Apache YuniKorn` scheduler docs

2022-09-01 Thread GitBox


dongjoon-hyun commented on PR #37622:
URL: https://github.com/apache/spark/pull/37622#issuecomment-1234034066

   I created a test suite PR, @yangwwei .
   - https://github.com/apache/spark/pull/37753


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] panbingkun opened a new pull request, #37754: [SPARK-39906][INFRA][FOLLOWGUP] Eliminate build warnings - sbt 0.13 hell syntax is deprecated; use slash syntax instead

2022-09-01 Thread GitBox


panbingkun opened a new pull request, #37754:
URL: https://github.com/apache/spark/pull/37754

   ### What changes were proposed in this pull request?
   The Pr is following https://github.com/apache/spark/pull/37326
   when I run BLASBenchmark on github, found **The following warnings are 
displayed in github action log:**
   https://user-images.githubusercontent.com/15246973/187882232-23681d67-930b-4801-bf0b-10185b6d6ae5.png;>
   
   ### Why are the changes needed?
   Use the recommended sbt syntax to eliminate build warnings: _sbt 0.13 shell 
syntax is deprecated; use slash syntax instead
   
   ### Does this PR introduce _any_ user-facing change?
   No.
   
   ### How was this patch tested?
   Pass GA.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   >