[GitHub] [spark] HyukjinKwon commented on a diff in pull request #39628: [SPARK-40264][ML][DOCS] Supplement docstring in pyspark.ml.functions.predict_batch_udf

GitBox Fri, 20 Jan 2023 00:13:30 -0800


HyukjinKwon commented on code in PR #39628:
URL: https://github.com/apache/spark/pull/39628#discussion_r1082211081



##########
python/pyspark/ml/functions.py:
##########
@@ -647,37 +386,369 @@ def predict_columnar(x1: np.ndarray, x2: np.ndarray) -> 
Mapping[str, np.ndarray]
         Function which is responsible for loading a model and returning a
         :py:class:`PredictBatchFunction` which takes one or more numpy arrays 
as input and returns
         one of the following:
-        - a numpy array (for a single output)
-        - a dictionary of named numpy arrays (for multiple outputs)
-        - a row-oriented list of dictionaries (for multiple outputs).
+
+        * a numpy array (for a single output)
+        * a dictionary of named numpy arrays (for multiple outputs)
+        * a row-oriented list of dictionaries (for multiple outputs).
+
         For a dictionary of named numpy arrays, the arrays can only be one or 
two dimensional, since
-        higher dimension arrays are not supported.  For a row-oriented list of 
dictionaries, each
+        higher dimensional arrays are not supported.  For a row-oriented list 
of dictionaries, each
         element in the dictionary must be either a scalar or one-dimensional 
array.
-    return_type : :class:`pspark.sql.types.DataType` or str.
+    return_type : :py:class:`pyspark.sql.types.DataType` or str.
         Spark SQL datatype for the expected output:
-        - Scalar (e.g. IntegerType, FloatType) --> 1-dim numpy array.
-        - ArrayType --> 2-dim numpy array.
-        - StructType --> dict with keys matching struct fields.
-        - StructType --> list of dict with keys matching struct fields, for 
models like the
-        [Huggingface pipeline for sentiment 
analysis](https://huggingface.co/docs/transformers/quicktour#pipeline-usage]  # 
noqa: E501
+
+        * Scalar (e.g. IntegerType, FloatType) --> 1-dim numpy array.
+        * ArrayType --> 2-dim numpy array.
+        * StructType --> dict with keys matching struct fields.
+        * StructType --> list of dict with keys matching struct fields, for 
models like the
+          `Huggingface pipeline for sentiment analysis
+          
<https://huggingface.co/docs/transformers/quicktour#pipeline-usage>`_.
+
     batch_size : int
-        Batch size to use for inference, note that this is typically a 
limitation of the model
-        and/or the hardware resources and is usually smaller than the Spark 
partition size.
-    input_tensor_shapes: List[List[int] | None] | Mapping[int, List[int]] | 
None
-        Optional input tensor shapes for models with tensor inputs.  This can 
be a list of shapes,
+        Batch size to use for inference.  This is typically a limitation of 
the model
+        and/or available hardware resources and is usually smaller than the 
Spark partition size.
+    input_tensor_shapes : List[List[int] | None] | Mapping[int, List[int]] | 
None, optional
+        Input tensor shapes for models with tensor inputs.  This can be a list 
of shapes,
         where each shape is a list of integers or None (for scalar inputs).  
Alternatively, this
         can be represented by a "sparse" dictionary, where the keys are the 
integer indices of the
         inputs, and the values are the shapes.  Each tensor input value in the 
Spark DataFrame must
         be represented as a single column containing a flattened 1-D array.  
The provided
-        input_tensor_shapes will be used to reshape the flattened array into 
expected tensor shape.
-        For the list form, the order of the tensor shapes must match the order 
of the selected
-        DataFrame columns.  The batch dimension (typically -1 or None in the 
first dimension) should
-        not be included, since it will be determined by the batch_size 
argument.  Tabular datasets
-        with scalar-valued columns should not provide this argument.
+        `input_tensor_shapes` will be used to reshape the flattened array into 
the expected tensor
+        shape.  For the list form, the order of the tensor shapes must match 
the order of the
+        selected DataFrame columns.  The batch dimension (typically -1 or None 
in the first
+        dimension) should not be included, since it will be determined by the 
batch_size argument.
+        Tabular datasets with scalar-valued columns should not provide this 
argument.
 
     Returns
     -------
-    A pandas_udf for predicting a batch.
+    :py:class:`UserDefinedFunctionLike`
+        A Pandas UDF for model inference on a Spark DataFrame.
+
+    Examples
+    --------
+    For a pre-trained TensorFlow MNIST model with two-dimensional input images 
represented as a
+    flattened tensor value stored in a single Spark DataFrame column of type 
`array<float>`.::
+
+        from pyspark.ml.functions import predict_batch_udf

Review Comment:
   If this example cannot be ran, should better do it with `.. code-block:: 
python`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #39628: [SPARK-40264][ML][DOCS] Supplement docstring in pyspark.ml.functions.predict_batch_udf

Reply via email to