This is an automated email from the ASF dual-hosted git repository. ruifengz pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git
The following commit(s) were added to refs/heads/master by this push: new 8482ec9e5d8 [SPARK-40265][PS] Fix the inconsistent behavior for Index.intersection 8482ec9e5d8 is described below commit 8482ec9e5d832f89fa55d29cdde0f8005a062f17 Author: itholic <haejoon....@databricks.com> AuthorDate: Mon Sep 5 14:56:37 2022 +0800 [SPARK-40265][PS] Fix the inconsistent behavior for Index.intersection ### What changes were proposed in this pull request? This PR proposes to fix the inconsistent behavior for `Index.intersection` function as below: When `other` is list of tuple, the behavior of pandas API on Spark is difference from pandas. - pandas API on Spark ```python >>> psidx Int64Index([1, 2, 3, 4], dtype='int64', name='Koalas') >>> psidx.intersection([(1, 2), (3, 4)]).sort_values() MultiIndex([], ) ``` - pandas ```python >>> pidx Int64Index([1, 2, 3, 4], dtype='int64', name='Koalas') >>> pidx.intersection([(1, 2), (3, 4)]).sort_values() Traceback (most recent call last): ... ValueError: Names should be list-like for a MultiIndex ``` ### Why are the changes needed? To reach parity with pandas. ### Does this PR introduce _any_ user-facing change? Yes, the behavior of `Index.intersection` is chaged, when the `other` is list of tuple: - Before ```python >>> psidx Int64Index([1, 2, 3, 4], dtype='int64', name='Koalas') >>> psidx.intersection([(1, 2), (3, 4)]).sort_values() MultiIndex([], ) ``` - After ```python >>> psidx Int64Index([1, 2, 3, 4], dtype='int64', name='Koalas') >>> psidx.intersection([(1, 2), (3, 4)]).sort_values() Traceback (most recent call last): ... ValueError: Names should be list-like for a MultiIndex ``` ### How was this patch tested? Added a unit test. Closes #37739 from itholic/SPARK-40265. Authored-by: itholic <haejoon....@databricks.com> Signed-off-by: Ruifeng Zheng <ruife...@apache.org> --- python/pyspark/pandas/indexes/base.py | 2 +- python/pyspark/pandas/tests/indexes/test_base.py | 3 +++ 2 files changed, 4 insertions(+), 1 deletion(-) diff --git a/python/pyspark/pandas/indexes/base.py b/python/pyspark/pandas/indexes/base.py index facedb1dc91..5043325ccbb 100644 --- a/python/pyspark/pandas/indexes/base.py +++ b/python/pyspark/pandas/indexes/base.py @@ -2509,7 +2509,7 @@ class Index(IndexOpsMixin): elif is_list_like(other): other_idx = Index(other) if isinstance(other_idx, MultiIndex): - return other_idx.to_frame().head(0).index + raise ValueError("Names should be list-like for a MultiIndex") spark_frame_other = other_idx.to_frame()._to_spark() keep_name = True else: diff --git a/python/pyspark/pandas/tests/indexes/test_base.py b/python/pyspark/pandas/tests/indexes/test_base.py index 958314c5741..169a22571ec 100644 --- a/python/pyspark/pandas/tests/indexes/test_base.py +++ b/python/pyspark/pandas/tests/indexes/test_base.py @@ -1977,6 +1977,9 @@ class IndexesTest(ComparisonTestBase, TestUtils): psidx.intersection(ps.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})) with self.assertRaisesRegex(ValueError, "Index data must be 1-dimensional"): psmidx.intersection(ps.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})) + # other = list of tuple + with self.assertRaisesRegex(ValueError, "Names should be list-like for a MultiIndex"): + psidx.intersection([(1, 2), (3, 4)]) def test_item(self): pidx = pd.Index([10]) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org