This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
     new 3148511a923 [SPARK-43123][PS] Raise `TypeError` for 
`DataFrame.interpolate` when all columns are object-dtype
3148511a923 is described below

commit 3148511a923bf59ea37d8f44e7427cde66f9f167
Author: Haejoon Lee <haejoon....@databricks.com>
AuthorDate: Tue Sep 12 14:36:42 2023 +0800

    [SPARK-43123][PS] Raise `TypeError` for `DataFrame.interpolate` when all 
columns are object-dtype
    
    ### What changes were proposed in this pull request?
    
    This PR proposes to aise `TypeError` for `DataFrame.interpolate` when all 
columns are object-dtype.
    
    ### Why are the changes needed?
    
    To match the behavior of Pandas:
    ```python
    >>> pd.DataFrame({"A": ['a', 'b', 'c'], "B": ['a', 'b', 'c']}).interpolate()
    ...
    TypeError: Cannot interpolate with all object-dtype columns in the 
DataFrame. Try setting at least one column to a numeric dtype.
    ```
    We currently return empty DataFrame instead of raise TypeError:
    ```python
    >>> pd.DataFrame({"A": ['a', 'b', 'c'], "B": ['a', 'b', 'c']}).interpolate()
    Empty DataFrame
    Columns: []
    Index: [0, 1, 2]
    ```
    
    ### Does this PR introduce _any_ user-facing change?
    
    Compute `DataFrame.interpolate` on DataFrame that has all object-dtype 
columns will raise TypeError instead of returning an empty DataFrame.
    
    ### How was this patch tested?
    
    Added UT.
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    No.
    
    Closes #42878 from itholic/SPARK-45123.
    
    Authored-by: Haejoon Lee <haejoon....@databricks.com>
    Signed-off-by: Ruifeng Zheng <ruife...@apache.org>
---
 python/pyspark/pandas/frame.py                        | 5 +++++
 python/pyspark/pandas/tests/test_frame_interpolate.py | 5 +++++
 2 files changed, 10 insertions(+)

diff --git a/python/pyspark/pandas/frame.py b/python/pyspark/pandas/frame.py
index adbef607256..3aebbd65427 100644
--- a/python/pyspark/pandas/frame.py
+++ b/python/pyspark/pandas/frame.py
@@ -6097,6 +6097,11 @@ defaultdict(<class 'list'>, {'col..., 'col...})]
             if isinstance(psser.spark.data_type, (NumericType, BooleanType)):
                 numeric_col_names.append(psser.name)
 
+        if len(numeric_col_names) == 0:
+            raise TypeError(
+                "Cannot interpolate with all object-dtype columns in the 
DataFrame. "
+                "Try setting at least one column to a numeric dtype."
+            )
         psdf = self[numeric_col_names]
         return psdf._apply_series_op(
             lambda psser: psser._interpolate(
diff --git a/python/pyspark/pandas/tests/test_frame_interpolate.py 
b/python/pyspark/pandas/tests/test_frame_interpolate.py
index 5b5856f7ab8..17c73781f8e 100644
--- a/python/pyspark/pandas/tests/test_frame_interpolate.py
+++ b/python/pyspark/pandas/tests/test_frame_interpolate.py
@@ -53,6 +53,11 @@ class FrameInterpolateTestsMixin:
         with self.assertRaisesRegex(ValueError, "invalid limit_area"):
             psdf.id.interpolate(limit_area="jump")
 
+        with self.assertRaisesRegex(
+            TypeError, "Cannot interpolate with all object-dtype columns in 
the DataFrame."
+        ):
+            ps.DataFrame({"A": ["a", "b", "c"], "B": ["a", "b", 
"c"]}).interpolate()
+
     def _test_interpolate(self, pobj):
         psobj = ps.from_pandas(pobj)
         self.assert_eq(psobj.interpolate(), pobj.interpolate())


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

Reply via email to