This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch branch-3.4 in repository https://gitbox.apache.org/repos/asf/spark.git
The following commit(s) were added to refs/heads/branch-3.4 by this push: new d0fd730839d8 [SPARK-47824][PS] Fix nondeterminism in pyspark.pandas.series.asof d0fd730839d8 is described below commit d0fd730839d8c4351781efb6aee5ff8f7c342ecf Author: Mark Jarvin <mark.jar...@databricks.com> AuthorDate: Fri Apr 12 09:37:19 2024 +0900 [SPARK-47824][PS] Fix nondeterminism in pyspark.pandas.series.asof ### What changes were proposed in this pull request? Use the monotonically ID as a sorting condition for `max_by` instead of a literal string. ### Why are the changes needed? https://github.com/apache/spark/pull/35191 had a error where the literal string `"__monotonically_increasing_id__"` was used as the tie-breaker in `max_by` instead of the actual ID. ### Does this PR introduce _any_ user-facing change? Fixes nondeterminism in `asof` ### How was this patch tested? In some circumstances `//python:pyspark.pandas.tests.connect.series.test_parity_as_of` is sufficient to reproduce ### Was this patch authored or co-authored using generative AI tooling? No Closes #46018 from markj-db/SPARK-47824. Authored-by: Mark Jarvin <mark.jar...@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls...@apache.org> (cherry picked from commit a0ccdf27e5ff30817b8f058f08f98d5b44bad2db) Signed-off-by: Hyukjin Kwon <gurwls...@apache.org> --- python/pyspark/pandas/series.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/python/pyspark/pandas/series.py b/python/pyspark/pandas/series.py index 5d6c25eca69e..4e2e3ffbb548 100644 --- a/python/pyspark/pandas/series.py +++ b/python/pyspark/pandas/series.py @@ -5878,7 +5878,7 @@ class Series(Frame, IndexOpsMixin, Generic[T]): # then return monotonically_increasing_id. This will let max by # to return last index value, which is the behaviour of pandas else spark_column.isNotNull(), - monotonically_increasing_id_column, + F.col(monotonically_increasing_id_column), ), ) for index in where --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org