[GitHub] [spark] xinrong-databricks commented on a change in pull request #33506: [SPARK-36260][PYTHON] Add set_categories to CategoricalAccessor and CategoricalIndex

2021-07-26 Thread GitBox


xinrong-databricks commented on a change in pull request #33506:
URL: https://github.com/apache/spark/pull/33506#discussion_r676986923



##
File path: python/pyspark/pandas/categorical.py
##
@@ -680,12 +681,152 @@ def reorder_categories(
 
 def set_categories(
 self,
-new_categories: pd.Index,
-ordered: bool = None,
+new_categories: Union[pd.Index, List],
+ordered: Optional[bool] = None,
 rename: bool = False,
 inplace: bool = False,
-) -> "ps.Series":
-raise NotImplementedError()
+) -> Optional["ps.Series"]:
+"""
+Set the categories to the specified new_categories.
+
+`new_categories` can include new categories (which will result in
+unused categories) or remove old categories (which results in values
+set to NaN). If `rename==True`, the categories will simple be renamed
+(less or more items than in old categories will result in values set to
+NaN or in unused categories respectively).
+
+This method can be used to perform more than one action of adding,
+removing, and reordering simultaneously and is therefore faster than
+performing the individual steps via the more specialised methods.
+
+On the other hand this methods does not do checks (e.g., whether the
+old categories are included in the new categories on a reorder), which
+can result in surprising changes, for example when using special string
+dtypes, which does not considers a S1 string equal to a single char
+python string.
+
+Parameters
+--
+new_categories : Index-like
+   The categories in new order.
+ordered : bool, default False
+   Whether or not the categorical is treated as a ordered categorical.
+   If not given, do not change the ordered information.
+rename : bool, default False
+   Whether or not the new_categories should be considered as a rename
+   of the old categories or as reordered categories.
+inplace : bool, default False
+   Whether or not to reorder the categories in-place or return a copy
+   of this categorical with reordered categories.
+
+Returns
+---
+Series with reordered categories or None if inplace.
+
+Raises
+--
+ValueError
+If new_categories does not validate as categories
+
+See Also
+
+rename_categories : Rename categories.
+reorder_categories : Reorder categories.
+add_categories : Add new categories.
+remove_categories : Remove the specified categories.
+remove_unused_categories : Remove categories which are not used.
+
+Examples
+
+>>> s = ps.Series(list("abbccc"), dtype="category")
+>>> s  # doctest: +SKIP
+0a
+1b
+2b
+3c
+4c
+5c
+dtype: category
+Categories (3, object): ['a', 'b', 'c']
+
+>>> s.cat.set_categories(['b', 'c'])  # doctest: +SKIP
+0NaN
+1  b
+2  b
+3  c
+4  c
+5  c
+dtype: category
+Categories (2, object): ['b', 'c']
+
+>>> s.cat.set_categories([1, 2, 3], rename=True)  # doctest: +SKIP
+01
+12
+22
+33
+43
+53
+dtype: category
+Categories (3, int64): [1, 2, 3]
+
+>>> s.cat.set_categories([1, 2, 3], rename=True, ordered=True)  # 
doctest: +SKIP
+01
+12
+22
+33
+43
+53
+dtype: category
+Categories (3, int64): [1 < 2 < 3]
+"""
+from pyspark.pandas.frame import DataFrame
+
+if not is_list_like(new_categories):
+raise TypeError(
+"Parameter 'new_categories' must be list-like, was 
'{}'".format(new_categories)
+)
+
+if ordered is None:
+ordered = self.ordered
+
+new_dtype = CategoricalDtype(new_categories, ordered=ordered)
+scol = self._data.spark.column
+
+if rename:
+new_scol = (
+F.when(scol >= len(new_categories), -1)
+.otherwise(scol)
+.alias(self._data._internal.data_spark_column_names[0])
+)
+
+internal = self._data._psdf._internal.with_new_spark_column(
+self._data._column_label,
+new_scol,
+field=self._data._internal.data_fields[0].copy(
+dtype=new_dtype, spark_type=IntegerType()

Review comment:
   cast to the original spark type.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go

[GitHub] [spark] xinrong-databricks commented on a change in pull request #33506: [SPARK-36260][PYTHON] Add set_categories to CategoricalAccessor and CategoricalIndex

2021-07-26 Thread GitBox


xinrong-databricks commented on a change in pull request #33506:
URL: https://github.com/apache/spark/pull/33506#discussion_r676860245



##
File path: python/pyspark/pandas/categorical.py
##
@@ -593,12 +595,107 @@ def reorder_categories(
 
 def set_categories(
 self,
-new_categories: pd.Index,
+new_categories: Union[pd.Index, List],
 ordered: bool = None,
 rename: bool = False,
 inplace: bool = False,
-) -> "ps.Series":
-raise NotImplementedError()
+) -> Optional["ps.Series"]:
+"""
+Set the categories to the specified new_categories.
+
+`new_categories` can include new categories (which will result in
+unused categories) or remove old categories (which results in values
+set to NaN). If `rename==True`, the categories will simple be renamed
+(less or more items than in old categories will result in values set to
+NaN or in unused categories respectively).
+
+This method can be used to perform more than one action of adding,
+removing, and reordering simultaneously and is therefore faster than
+performing the individual steps via the more specialised methods.
+
+On the other hand this methods does not do checks (e.g., whether the
+old categories are included in the new categories on a reorder), which
+can result in surprising changes, for example when using special string
+dtypes, which does not considers a S1 string equal to a single char
+python string.
+
+Parameters
+--
+new_categories : Index-like
+   The categories in new order.
+ordered : bool, default False
+   Whether or not the categorical is treated as a ordered categorical.
+   If not given, do not change the ordered information.
+rename : bool, default False
+   Whether or not the new_categories should be considered as a rename
+   of the old categories or as reordered categories.
+inplace : bool, default False
+   Whether or not to reorder the categories in-place or return a copy
+   of this categorical with reordered categories.
+
+Returns
+---
+Series with reordered categories or None if inplace.
+
+Raises
+--
+ValueError
+If new_categories does not validate as categories
+
+See Also
+
+rename_categories : Rename categories.
+reorder_categories : Reorder categories.
+add_categories : Add new categories.
+remove_categories : Remove the specified categories.
+remove_unused_categories : Remove categories which are not used.
+"""
+from pyspark.pandas.frame import DataFrame
+
+if not is_list_like(new_categories):
+raise TypeError(
+"Parameter 'new_categories' must be list-like, was 
'{}'".format(new_categories)
+)
+
+if ordered is None:
+ordered = self.ordered
+
+new_dtype = CategoricalDtype(new_categories, ordered=ordered)
+scol = self._data.spark.column
+
+if rename:
+new_scol = (
+F.when(scol >= len(new_categories), -1)
+.otherwise(scol)
+.alias(name_like_string(self._data._column_label))

Review comment:
   Makes sense, updated.

##
File path: python/pyspark/pandas/categorical.py
##
@@ -593,12 +595,107 @@ def reorder_categories(
 
 def set_categories(
 self,
-new_categories: pd.Index,
+new_categories: Union[pd.Index, List],
 ordered: bool = None,
 rename: bool = False,
 inplace: bool = False,
-) -> "ps.Series":
-raise NotImplementedError()
+) -> Optional["ps.Series"]:
+"""
+Set the categories to the specified new_categories.
+
+`new_categories` can include new categories (which will result in
+unused categories) or remove old categories (which results in values
+set to NaN). If `rename==True`, the categories will simple be renamed
+(less or more items than in old categories will result in values set to
+NaN or in unused categories respectively).
+
+This method can be used to perform more than one action of adding,
+removing, and reordering simultaneously and is therefore faster than
+performing the individual steps via the more specialised methods.
+
+On the other hand this methods does not do checks (e.g., whether the
+old categories are included in the new categories on a reorder), which
+can result in surprising changes, for example when using special string
+dtypes, which does not considers a S1 string equal to a single char
+python string.
+
+Parameters
+--
+new_categories : Index-like
+  

[GitHub] [spark] xinrong-databricks commented on a change in pull request #33506: [SPARK-36260][PYTHON] Add set_categories to CategoricalAccessor and CategoricalIndex

2021-07-26 Thread GitBox


xinrong-databricks commented on a change in pull request #33506:
URL: https://github.com/apache/spark/pull/33506#discussion_r676858391



##
File path: python/pyspark/pandas/data_type_ops/categorical_ops.py
##
@@ -42,6 +42,16 @@ def pretty_name(self) -> str:
 
 def restore(self, col: pd.Series) -> pd.Series:
 """Restore column when to_pandas."""
+try:
+pd.Categorical.from_codes(
+col.replace(np.nan, -1).astype(int),
+categories=cast(CategoricalDtype, self.dtype).categories,
+ordered=cast(CategoricalDtype, self.dtype).ordered,
+)
+except:
+print(col)
+print(self.dtype.categories)
+

Review comment:
   Oh thanks, removed.

##
File path: python/pyspark/pandas/categorical.py
##
@@ -593,12 +595,107 @@ def reorder_categories(
 
 def set_categories(
 self,
-new_categories: pd.Index,
+new_categories: Union[pd.Index, List],
 ordered: bool = None,

Review comment:
   Modified.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org