You can use your groupId as a grid parameter, filter your dataset using
this id in a pipeline stage, before feeding it to the model.
The following may help:
The above should work ,but I haven't tried it myself. What I have tried is
the following Embarrassingly Parallel architecture (as TensorFlow was a
requirement in the use case):
See a PySpark/TensorFlow example here:
A relevant excerpt from the notebook mentioned above:
num_nodes = 4
n = max(2, int(len(all_experiments) // num_nodes))
grouped_experiments = [all_experiments[i:i+n] for i in range(0,
all_exps_rdd = sc.parallelize(grouped_experiments,
results = all_exps_rdd.flatMap(lambda z: [run(*y) for y in z]).collect()
Again, like above, you use your groupId as a parameter in the grid search;
it works if your full dataset fits in the memory of a single machine. You
can broadcast the dataset in a compressed format and do the preprocessing
and feature engineering after you've done the filtering on groupId to
maximize the size of the dataset that can use this modeling approach.
Masood Krohy, Ph.D.
Data Scientist, Intact Lab-R&D
Intact Financial Corporation
De : Xiaomeng Wan <shawn...@gmail.com>
A : User <email@example.com>
Date : 2016-11-29 11:54
Objet : build models in parallel
I want to divide big data into groups (eg groupby some id), and build one
model for each group. I am wondering whether I can parallelize the model
building process by implementing a UDAF (eg running linearregression in
its evaluate mothod). is it good practice? anybody has experience? Thanks!