Xiangrui Meng created SPARK-26412: ------------------------------------- Summary: Allow Pandas UDF to take an iterator of pd.DataFrames for the entire partition Key: SPARK-26412 URL: https://issues.apache.org/jira/browse/SPARK-26412 Project: Spark Issue Type: New Feature Components: PySpark Affects Versions: 3.0.0 Reporter: Xiangrui Meng
Pandas UDF is the ideal connection between PySpark and DL model inference workload. However, user needs to load the model file first to make predictions. It is common to see models of size ~100MB or bigger. If the Pandas UDF execution is limited to batch scope, user need to repeatedly load the same model for every batch in the same python worker process, which is inefficient. I created this JIRA to discuss possible solutions. Essentially we need to support "start()" and "finish()" besides "apply". We can either provide those interfaces or simply provide users the iterator of batches in pd.DataFrame and let user code handle it. cc: [~icexelloss] [~bryanc] [~holdenk] [~hyukjin.kwon] [~ueshin] [~smilegator] -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org