I guess one drawback would be that the data cannot be processed and stored
in Pandas DataFrames as these DataFrames store data in RAM. If you are
going to run multiple parallel jobs then a single machine may not be viable?




On Thu, 21 Jan 2021 at 16:29, Sean Owen <[email protected]> wrote:

> If you mean you want to train N models in parallel, you wouldn't be able
> to do that with a groupBy first. You apply logic to the result of groupBy
> with Spark, but can't use Spark within Spark. You can run N Spark jobs in
> parallel on the driver but you'd have to have each read the subset of data
> that it's meant to model separately.
>
> A pandas UDF is a fine solution here, because I assume that implies your
> groups aren't that big, so, maybe no need for a Spark pipeline.
>
>
> On Thu, Jan 21, 2021 at 9:20 AM Riccardo Ferrari <[email protected]>
> wrote:
>
>> Hi list,
>>
>> I am looking for an efficient solution to apply a training pipeline to
>> each group of a DataFrame.groupBy.
>>
>> This is very easy if you're using a pandas udf (i.e. groupBy().apply()),
>> I am not able to find the equivalent for a spark pipeline.
>>
>> The ultimate goal is to fit multiple models, one per group of data.
>>
>> Thanks,
>>
>>

Reply via email to