Xiangrui Meng created SPARK-26412:
-------------------------------------

             Summary: Allow Pandas UDF to take an iterator of pd.DataFrames for 
the entire partition
                 Key: SPARK-26412
                 URL: https://issues.apache.org/jira/browse/SPARK-26412
             Project: Spark
          Issue Type: New Feature
          Components: PySpark
    Affects Versions: 3.0.0
            Reporter: Xiangrui Meng


Pandas UDF is the ideal connection between PySpark and DL model inference 
workload. However, user needs to load the model file first to make predictions. 
It is common to see models of size ~100MB or bigger. If the Pandas UDF 
execution is limited to batch scope, user need to repeatedly load the same 
model for every batch in the same python worker process, which is inefficient. 
I created this JIRA to discuss possible solutions.

Essentially we need to support "start()" and "finish()" besides "apply". We can 
either provide those interfaces or simply provide users the iterator of batches 
in pd.DataFrame and let user code handle it.

cc: [~icexelloss] [~bryanc] [~holdenk] [~hyukjin.kwon] [~ueshin] [~smilegator]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to