This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git
The following commit(s) were added to refs/heads/master by this push: new a7f8ccef122a [SPARK-47891][PYTHON][DOCS] Improve docstring of mapInPandas a7f8ccef122a is described below commit a7f8ccef122a629559bae91e3847589c4cf1a46a Author: Xinrong Meng <xinr...@apache.org> AuthorDate: Thu Apr 18 09:47:47 2024 +0900 [SPARK-47891][PYTHON][DOCS] Improve docstring of mapInPandas ### What changes were proposed in this pull request? Improve docstring of mapInPandas - "using a Python native function that takes and outputs a pandas DataFrame" is confusing cause the function takes and outputs "ITERATOR of pandas DataFrames" instead. - "All columns are passed together as an iterator of pandas DataFrames" easily mislead users to think the entire DataFrame will be passed together, "a batch of rows" is used instead. ### Why are the changes needed? More accurate and clear docstring. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Doc change only. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46108 from xinrong-meng/doc_mapInPandas. Authored-by: Xinrong Meng <xinr...@apache.org> Signed-off-by: Hyukjin Kwon <gurwls...@apache.org> --- python/pyspark/sql/pandas/map_ops.py | 21 ++++++++++----------- 1 file changed, 10 insertions(+), 11 deletions(-) diff --git a/python/pyspark/sql/pandas/map_ops.py b/python/pyspark/sql/pandas/map_ops.py index 82bcd58b0c0e..6d8bb7c779b7 100644 --- a/python/pyspark/sql/pandas/map_ops.py +++ b/python/pyspark/sql/pandas/map_ops.py @@ -30,7 +30,7 @@ if TYPE_CHECKING: class PandasMapOpsMixin: """ - Min-in for pandas map operations. Currently, only :class:`DataFrame` + Mix-in for pandas map operations. Currently, only :class:`DataFrame` can use this class. """ @@ -43,16 +43,14 @@ class PandasMapOpsMixin: ) -> "DataFrame": """ Maps an iterator of batches in the current :class:`DataFrame` using a Python native - function that takes and outputs a pandas DataFrame, and returns the result as a - :class:`DataFrame`. + function that is performed on pandas DataFrames both as input and output, + and returns the result as a :class:`DataFrame`. - The function should take an iterator of `pandas.DataFrame`\\s and return - another iterator of `pandas.DataFrame`\\s. All columns are passed - together as an iterator of `pandas.DataFrame`\\s to the function and the - returned iterator of `pandas.DataFrame`\\s are combined as a :class:`DataFrame`. - Each `pandas.DataFrame` size can be controlled by - `spark.sql.execution.arrow.maxRecordsPerBatch`. The size of the function's input and - output can be different. + This method applies the specified Python function to an iterator of + `pandas.DataFrame`\\s, each representing a batch of rows from the original DataFrame. + The returned iterator of `pandas.DataFrame`\\s are combined as a :class:`DataFrame`. + The size of the function's input and output can be different. Each `pandas.DataFrame` + size can be controlled by `spark.sql.execution.arrow.maxRecordsPerBatch`. .. versionadded:: 3.0.0 @@ -68,7 +66,8 @@ class PandasMapOpsMixin: the return type of the `func` in PySpark. The value can be either a :class:`pyspark.sql.types.DataType` object or a DDL-formatted type string. barrier : bool, optional, default False - Use barrier mode execution. + Use barrier mode execution, ensuring that all Python workers in the stage will be + launched concurrently. .. versionadded: 3.5.0 --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org