HyukjinKwon commented on code in PR #39628: URL: https://github.com/apache/spark/pull/39628#discussion_r1082211081
########## python/pyspark/ml/functions.py: ########## @@ -647,37 +386,369 @@ def predict_columnar(x1: np.ndarray, x2: np.ndarray) -> Mapping[str, np.ndarray] Function which is responsible for loading a model and returning a :py:class:`PredictBatchFunction` which takes one or more numpy arrays as input and returns one of the following: - - a numpy array (for a single output) - - a dictionary of named numpy arrays (for multiple outputs) - - a row-oriented list of dictionaries (for multiple outputs). + + * a numpy array (for a single output) + * a dictionary of named numpy arrays (for multiple outputs) + * a row-oriented list of dictionaries (for multiple outputs). + For a dictionary of named numpy arrays, the arrays can only be one or two dimensional, since - higher dimension arrays are not supported. For a row-oriented list of dictionaries, each + higher dimensional arrays are not supported. For a row-oriented list of dictionaries, each element in the dictionary must be either a scalar or one-dimensional array. - return_type : :class:`pspark.sql.types.DataType` or str. + return_type : :py:class:`pyspark.sql.types.DataType` or str. Spark SQL datatype for the expected output: - - Scalar (e.g. IntegerType, FloatType) --> 1-dim numpy array. - - ArrayType --> 2-dim numpy array. - - StructType --> dict with keys matching struct fields. - - StructType --> list of dict with keys matching struct fields, for models like the - [Huggingface pipeline for sentiment analysis](https://huggingface.co/docs/transformers/quicktour#pipeline-usage] # noqa: E501 + + * Scalar (e.g. IntegerType, FloatType) --> 1-dim numpy array. + * ArrayType --> 2-dim numpy array. + * StructType --> dict with keys matching struct fields. + * StructType --> list of dict with keys matching struct fields, for models like the + `Huggingface pipeline for sentiment analysis + <https://huggingface.co/docs/transformers/quicktour#pipeline-usage>`_. + batch_size : int - Batch size to use for inference, note that this is typically a limitation of the model - and/or the hardware resources and is usually smaller than the Spark partition size. - input_tensor_shapes: List[List[int] | None] | Mapping[int, List[int]] | None - Optional input tensor shapes for models with tensor inputs. This can be a list of shapes, + Batch size to use for inference. This is typically a limitation of the model + and/or available hardware resources and is usually smaller than the Spark partition size. + input_tensor_shapes : List[List[int] | None] | Mapping[int, List[int]] | None, optional + Input tensor shapes for models with tensor inputs. This can be a list of shapes, where each shape is a list of integers or None (for scalar inputs). Alternatively, this can be represented by a "sparse" dictionary, where the keys are the integer indices of the inputs, and the values are the shapes. Each tensor input value in the Spark DataFrame must be represented as a single column containing a flattened 1-D array. The provided - input_tensor_shapes will be used to reshape the flattened array into expected tensor shape. - For the list form, the order of the tensor shapes must match the order of the selected - DataFrame columns. The batch dimension (typically -1 or None in the first dimension) should - not be included, since it will be determined by the batch_size argument. Tabular datasets - with scalar-valued columns should not provide this argument. + `input_tensor_shapes` will be used to reshape the flattened array into the expected tensor + shape. For the list form, the order of the tensor shapes must match the order of the + selected DataFrame columns. The batch dimension (typically -1 or None in the first + dimension) should not be included, since it will be determined by the batch_size argument. + Tabular datasets with scalar-valued columns should not provide this argument. Returns ------- - A pandas_udf for predicting a batch. + :py:class:`UserDefinedFunctionLike` + A Pandas UDF for model inference on a Spark DataFrame. + + Examples + -------- + For a pre-trained TensorFlow MNIST model with two-dimensional input images represented as a + flattened tensor value stored in a single Spark DataFrame column of type `array<float>`.:: + + from pyspark.ml.functions import predict_batch_udf Review Comment: If this example cannot be ran, should better do it with `.. code-block:: python` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org