RE: build models in parallel

Masood Krohy Thu, 01 Dec 2016 07:24:47 -0800

You can use your groupId as a grid parameter, filter your dataset using 
this id in a pipeline stage, before feeding it to the model.
The following may help:
http://spark.apache.org/docs/latest/ml-tuning.html
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.tuning.ParamGridBuilder


The above should work ,but I haven't tried it myself. What I have tried is 
the following Embarrassingly Parallel architecture (as TensorFlow was a 
requirement in the use case):

See a PySpark/TensorFlow example here:
https://databricks.com/blog/2016/01/25/deep-learning-with-apache-spark-and-tensorflow.html

A relevant excerpt from the notebook mentioned above:
http://go.databricks.com/hubfs/notebooks/TensorFlow/Test_distributed_processing_of_images_using_TensorFlow.html
num_nodes = 4
n = max(2, int(len(all_experiments) // num_nodes))
grouped_experiments = [all_experiments[i:i+n] for i in range(0, 
len(all_experiments), n)]
all_exps_rdd = sc.parallelize(grouped_experiments, 
numSlices=len(grouped_experiments))
results = all_exps_rdd.flatMap(lambda z: [run(*y) for y in z]).collect()

Again, like above, you use your groupId as a parameter in the grid search; 
it works if your full dataset fits in the memory of a single machine. You 
can broadcast the dataset in a compressed format and do the preprocessing 
and feature engineering after you've done the filtering on groupId to 
maximize the size of the dataset that can use this modeling approach.
Masood


------------------------------
Masood Krohy, Ph.D. 
Data Scientist, Intact Lab-R&D 
Intact Financial Corporation 
http://ca.linkedin.com/in/masoodkh 



De :    Xiaomeng Wan <shawn...@gmail.com>
A :     User <user@spark.apache.org>
Date :  2016-11-29 11:54
Objet : build models in parallel



I want to divide big data into groups (eg groupby some id), and build one 
model for each group. I am wondering whether I can parallelize the model 
building process by implementing a UDAF (eg running linearregression in 
its evaluate mothod). is it good practice? anybody has experience? Thanks!

Regards,
Shawn

RE: build models in parallel

Reply via email to