You can use your groupId as a grid parameter, filter your dataset using 
this id in a pipeline stage, before feeding it to the model.
The following may help:

The above should work ,but I haven't tried it myself. What I have tried is 
the following Embarrassingly Parallel architecture (as TensorFlow was a 
requirement in the use case):

See a PySpark/TensorFlow example here:

A relevant excerpt from the notebook mentioned above:
num_nodes = 4
n = max(2, int(len(all_experiments) // num_nodes))
grouped_experiments = [all_experiments[i:i+n] for i in range(0, 
len(all_experiments), n)]
all_exps_rdd = sc.parallelize(grouped_experiments, 
results = all_exps_rdd.flatMap(lambda z: [run(*y) for y in z]).collect()

Again, like above, you use your groupId as a parameter in the grid search; 
it works if your full dataset fits in the memory of a single machine. You 
can broadcast the dataset in a compressed format and do the preprocessing 
and feature engineering after you've done the filtering on groupId to 
maximize the size of the dataset that can use this modeling approach.

Masood Krohy, Ph.D. 
Data Scientist, Intact Lab-R&D 
Intact Financial Corporation 

De :    Xiaomeng Wan <>
A :     User <>
Date :  2016-11-29 11:54
Objet : build models in parallel

I want to divide big data into groups (eg groupby some id), and build one 
model for each group. I am wondering whether I can parallelize the model 
building process by implementing a UDAF (eg running linearregression in 
its evaluate mothod). is it good practice? anybody has experience? Thanks!


Reply via email to