RE: build models in parallel

2016-12-01 Thread Masood Krohy
You can use your groupId as a grid parameter, filter your dataset using 
this id in a pipeline stage, before feeding it to the model.
The following may help:
http://spark.apache.org/docs/latest/ml-tuning.html
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.tuning.ParamGridBuilder

The above should work ,but I haven't tried it myself. What I have tried is 
the following Embarrassingly Parallel architecture (as TensorFlow was a 
requirement in the use case):

See a PySpark/TensorFlow example here:
https://databricks.com/blog/2016/01/25/deep-learning-with-apache-spark-and-tensorflow.html

A relevant excerpt from the notebook mentioned above:
http://go.databricks.com/hubfs/notebooks/TensorFlow/Test_distributed_processing_of_images_using_TensorFlow.html
num_nodes = 4
n = max(2, int(len(all_experiments) // num_nodes))
grouped_experiments = [all_experiments[i:i+n] for i in range(0, 
len(all_experiments), n)]
all_exps_rdd = sc.parallelize(grouped_experiments, 
numSlices=len(grouped_experiments))
results = all_exps_rdd.flatMap(lambda z: [run(*y) for y in z]).collect()

Again, like above, you use your groupId as a parameter in the grid search; 
it works if your full dataset fits in the memory of a single machine. You 
can broadcast the dataset in a compressed format and do the preprocessing 
and feature engineering after you've done the filtering on groupId to 
maximize the size of the dataset that can use this modeling approach.
Masood


--
Masood Krohy, Ph.D. 
Data Scientist, Intact Lab-R 
Intact Financial Corporation 
http://ca.linkedin.com/in/masoodkh 



De :Xiaomeng Wan 
A : User 
Date :  2016-11-29 11:54
Objet : build models in parallel



I want to divide big data into groups (eg groupby some id), and build one 
model for each group. I am wondering whether I can parallelize the model 
building process by implementing a UDAF (eg running linearregression in 
its evaluate mothod). is it good practice? anybody has experience? Thanks!

Regards,
Shawn



Re: build models in parallel

2016-11-29 Thread Georg Heiler
They https://www.youtube.com/watch?v=R-6nAwLyWCI use such functionality via
pyspark.

Xiaomeng Wan  schrieb am Di., 29. Nov. 2016 um
17:54 Uhr:

> I want to divide big data into groups (eg groupby some id), and build one
> model for each group. I am wondering whether I can parallelize the model
> building process by implementing a UDAF (eg running linearregression in its
> evaluate mothod). is it good practice? anybody has experience? Thanks!
>
> Regards,
> Shawn
>