You can use your groupId as a grid parameter, filter your dataset using this id in a pipeline stage, before feeding it to the model. The following may help: http://spark.apache.org/docs/latest/ml-tuning.html http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.tuning.ParamGridBuilder
The above should work ,but I haven't tried it myself. What I have tried is the following Embarrassingly Parallel architecture (as TensorFlow was a requirement in the use case): See a PySpark/TensorFlow example here: https://databricks.com/blog/2016/01/25/deep-learning-with-apache-spark-and-tensorflow.html A relevant excerpt from the notebook mentioned above: http://go.databricks.com/hubfs/notebooks/TensorFlow/Test_distributed_processing_of_images_using_TensorFlow.html num_nodes = 4 n = max(2, int(len(all_experiments) // num_nodes)) grouped_experiments = [all_experiments[i:i+n] for i in range(0, len(all_experiments), n)] all_exps_rdd = sc.parallelize(grouped_experiments, numSlices=len(grouped_experiments)) results = all_exps_rdd.flatMap(lambda z: [run(*y) for y in z]).collect() Again, like above, you use your groupId as a parameter in the grid search; it works if your full dataset fits in the memory of a single machine. You can broadcast the dataset in a compressed format and do the preprocessing and feature engineering after you've done the filtering on groupId to maximize the size of the dataset that can use this modeling approach. Masood ------------------------------ Masood Krohy, Ph.D. Data Scientist, Intact Lab-R&D Intact Financial Corporation http://ca.linkedin.com/in/masoodkh De : Xiaomeng Wan <shawn...@gmail.com> A : User <user@spark.apache.org> Date : 2016-11-29 11:54 Objet : build models in parallel I want to divide big data into groups (eg groupby some id), and build one model for each group. I am wondering whether I can parallelize the model building process by implementing a UDAF (eg running linearregression in its evaluate mothod). is it good practice? anybody has experience? Thanks! Regards, Shawn