Weichen Xu created SPARK-43289:
----------------------------------

             Summary: PySpark UDF supports python package dependencies
                 Key: SPARK-43289
                 URL: https://issues.apache.org/jira/browse/SPARK-43289
             Project: Spark
          Issue Type: New Feature
          Components: Connect, ML, PySpark
    Affects Versions: 3.5.0
            Reporter: Weichen Xu


h3. Requirements

 

Make the pyspark UDF support annotating python dependencies and when executing 
UDF, the UDF worker creates a new python environment with provided python 
dependencies.
h3. Motivation

 

We have two major cases:

 
 * For spark connect case, the client python environment is very likely to be 
different with pyspark server side python environment, this causes user's UDF 
function execution failure in pyspark server side.
 * Some machine learning third-party library (e.g. MLflow) requires pyspark UDF 
supporting  dependencies, because in ML cases, we need to run model inference 
by pyspark UDF in the exactly the same python environment that trains the 
model. Currently MLflow supports it by creating a child python process in 
pyspark UDF worker, and redirecting all UDF input data to the child python 
process to run model inference, this way it causes significant overhead, if 
pyspark UDF support builtin python dependency management then we don't need 
such poorly performing approach.

 
h3. Proposed API

```

@pandas_udf("string", pip_requirements=...)

```

`pip_requirements` argument means either an iterable of pip requirement strings 
(e.g. ``["scikit-learn", "-r /path/to/req2.txt", "-c 
/path/to/constraints.txt"]``) or the string path to a pip requirements file 
path on the local filesystem (e.g. ``"/path/to/requirements.txt"``) represents 
the pip requirements for the python UDF.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to