Weichen Xu created SPARK-43289: ---------------------------------- Summary: PySpark UDF supports python package dependencies Key: SPARK-43289 URL: https://issues.apache.org/jira/browse/SPARK-43289 Project: Spark Issue Type: New Feature Components: Connect, ML, PySpark Affects Versions: 3.5.0 Reporter: Weichen Xu
h3. Requirements Make the pyspark UDF support annotating python dependencies and when executing UDF, the UDF worker creates a new python environment with provided python dependencies. h3. Motivation We have two major cases: * For spark connect case, the client python environment is very likely to be different with pyspark server side python environment, this causes user's UDF function execution failure in pyspark server side. * Some machine learning third-party library (e.g. MLflow) requires pyspark UDF supporting dependencies, because in ML cases, we need to run model inference by pyspark UDF in the exactly the same python environment that trains the model. Currently MLflow supports it by creating a child python process in pyspark UDF worker, and redirecting all UDF input data to the child python process to run model inference, this way it causes significant overhead, if pyspark UDF support builtin python dependency management then we don't need such poorly performing approach. h3. Proposed API ``` @pandas_udf("string", pip_requirements=...) ``` `pip_requirements` argument means either an iterable of pip requirement strings (e.g. ``["scikit-learn", "-r /path/to/req2.txt", "-c /path/to/constraints.txt"]``) or the string path to a pip requirements file path on the local filesystem (e.g. ``"/path/to/requirements.txt"``) represents the pip requirements for the python UDF. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org