Hi Yanboo,
Thank You, I very much appreciate your help.
For the current use case, the data can fit into a single node. So, 
spark-sklearn seems to be good choice.

I have  on question regarding this
“If no, Spark MLlib provide CrossValidation which can run multiple machine 
learning algorithms parallel on distributed dataset and do parameter search. 
FYI: 
https://spark.apache.org/docs/latest/ml-tuning.html#cross-validation<https://urldefense.proofpoint.com/v2/url?u=https-3A__spark.apache.org_docs_latest_ml-2Dtuning.html-23cross-2Dvalidation&d=DwMFaQ&c=shNJtf5dKgNcPZ6Yh64b-A&r=wnzquyZN5LCZ2v6jPXe4F2nU9j4v9g_t24s63U3cYqE&m=FtsbdcfaOELxFW8EFphZgjTd7cl3Kc5oYsQ558EZb3A&s=lVvXRRGoh5uXJw-K246dNzogKEfb2yFYtxpTB9xxizo&e=>”
If I understand correctly, it can run parameter search for cross-validation in 
parallel.
However,  currently  Spark does not support  running multiple algorithms (like 
Naïve Bayes,  Random Forest, etc.) in parallel. Am I correct?
If not, could you please point me to some resources where they have run 
multiple algorithms in parallel.

Thank You very much. It is great help, I will try spark-sklearn.
Prem




From: Yanbo Liang <yblia...@gmail.com>
Date: Tuesday, September 5, 2017 at 10:40 AM
To: Patrick McCarthy <pmccar...@dstillery.com>
Cc: "Timsina, Prem" <prem.tims...@mssm.edu>, "user@spark.apache.org" 
<user@spark.apache.org>
Subject: Re: Apache Spark: Parallelization of Multiple Machine Learning 
ALgorithm

Hi Prem,

How large is your dataset? Can it be fitted in a single node?
If no, Spark MLlib provide CrossValidation which can run multiple machine 
learning algorithms parallel on distributed dataset and do parameter search. 
FYI: 
https://spark.apache.org/docs/latest/ml-tuning.html#cross-validation<https://urldefense.proofpoint.com/v2/url?u=https-3A__spark.apache.org_docs_latest_ml-2Dtuning.html-23cross-2Dvalidation&d=DwMFaQ&c=shNJtf5dKgNcPZ6Yh64b-A&r=wnzquyZN5LCZ2v6jPXe4F2nU9j4v9g_t24s63U3cYqE&m=FtsbdcfaOELxFW8EFphZgjTd7cl3Kc5oYsQ558EZb3A&s=lVvXRRGoh5uXJw-K246dNzogKEfb2yFYtxpTB9xxizo&e=>
If yes, you can also try spark-sklearn, which can distribute multiple model 
training(single node training with sklearn) across a distributed cluster and do 
parameter search. FYI: 
https://github.com/databricks/spark-sklearn<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_databricks_spark-2Dsklearn&d=DwMFaQ&c=shNJtf5dKgNcPZ6Yh64b-A&r=wnzquyZN5LCZ2v6jPXe4F2nU9j4v9g_t24s63U3cYqE&m=FtsbdcfaOELxFW8EFphZgjTd7cl3Kc5oYsQ558EZb3A&s=JfciAow01oTIYYCjhy83Q_nF85fKW9ZI-qYxfUa0BUU&e=>

Thanks
Yanbo

On Tue, Sep 5, 2017 at 9:56 PM, Patrick McCarthy 
<pmccar...@dstillery.com<mailto:pmccar...@dstillery.com>> wrote:
You might benefit from watching this JIRA issue - 
https://issues.apache.org/jira/browse/SPARK-19071<https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_SPARK-2D19071&d=DwMFaQ&c=shNJtf5dKgNcPZ6Yh64b-A&r=wnzquyZN5LCZ2v6jPXe4F2nU9j4v9g_t24s63U3cYqE&m=FtsbdcfaOELxFW8EFphZgjTd7cl3Kc5oYsQ558EZb3A&s=hQZ6ldug0XZvo4q87r0BQatn55B6UtyVVs0Ge9UneW4&e=>

On Sun, Sep 3, 2017 at 5:50 PM, Timsina, Prem 
<prem.tims...@mssm.edu<mailto:prem.tims...@mssm.edu>> wrote:
Is there a way to parallelize multiple ML algorithms in Spark. My use case is 
something like this:
A) Run multiple machine learning algorithm (Naive Bayes, ANN, Random Forest, 
etc.) in parallel.
1) Validate each algorithm using 10-fold cross-validation
B) Feed the output of step A) in second layer machine learning algorithm.
My question is:
Can we run multiple machine learning algorithm in step A in parallel?
Can we do cross-validation in parallel? Like, run 10 iterations of Naive Bayes 
training in parallel?

I was not able to find any way to run the different algorithm in parallel. And 
it seems cross-validation also can not be done in parallel.
I appreciate any suggestion to parallelize this use case.

Prem


Reply via email to