Hi Yanboo, Thank You, I very much appreciate your help. For the current use case, the data can fit into a single node. So, spark-sklearn seems to be good choice.
I have on question regarding this “If no, Spark MLlib provide CrossValidation which can run multiple machine learning algorithms parallel on distributed dataset and do parameter search. FYI: https://spark.apache.org/docs/latest/ml-tuning.html#cross-validation<https://urldefense.proofpoint.com/v2/url?u=https-3A__spark.apache.org_docs_latest_ml-2Dtuning.html-23cross-2Dvalidation&d=DwMFaQ&c=shNJtf5dKgNcPZ6Yh64b-A&r=wnzquyZN5LCZ2v6jPXe4F2nU9j4v9g_t24s63U3cYqE&m=FtsbdcfaOELxFW8EFphZgjTd7cl3Kc5oYsQ558EZb3A&s=lVvXRRGoh5uXJw-K246dNzogKEfb2yFYtxpTB9xxizo&e=>” If I understand correctly, it can run parameter search for cross-validation in parallel. However, currently Spark does not support running multiple algorithms (like Naïve Bayes, Random Forest, etc.) in parallel. Am I correct? If not, could you please point me to some resources where they have run multiple algorithms in parallel. Thank You very much. It is great help, I will try spark-sklearn. Prem From: Yanbo Liang <yblia...@gmail.com> Date: Tuesday, September 5, 2017 at 10:40 AM To: Patrick McCarthy <pmccar...@dstillery.com> Cc: "Timsina, Prem" <prem.tims...@mssm.edu>, "user@spark.apache.org" <user@spark.apache.org> Subject: Re: Apache Spark: Parallelization of Multiple Machine Learning ALgorithm Hi Prem, How large is your dataset? Can it be fitted in a single node? If no, Spark MLlib provide CrossValidation which can run multiple machine learning algorithms parallel on distributed dataset and do parameter search. FYI: https://spark.apache.org/docs/latest/ml-tuning.html#cross-validation<https://urldefense.proofpoint.com/v2/url?u=https-3A__spark.apache.org_docs_latest_ml-2Dtuning.html-23cross-2Dvalidation&d=DwMFaQ&c=shNJtf5dKgNcPZ6Yh64b-A&r=wnzquyZN5LCZ2v6jPXe4F2nU9j4v9g_t24s63U3cYqE&m=FtsbdcfaOELxFW8EFphZgjTd7cl3Kc5oYsQ558EZb3A&s=lVvXRRGoh5uXJw-K246dNzogKEfb2yFYtxpTB9xxizo&e=> If yes, you can also try spark-sklearn, which can distribute multiple model training(single node training with sklearn) across a distributed cluster and do parameter search. FYI: https://github.com/databricks/spark-sklearn<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_databricks_spark-2Dsklearn&d=DwMFaQ&c=shNJtf5dKgNcPZ6Yh64b-A&r=wnzquyZN5LCZ2v6jPXe4F2nU9j4v9g_t24s63U3cYqE&m=FtsbdcfaOELxFW8EFphZgjTd7cl3Kc5oYsQ558EZb3A&s=JfciAow01oTIYYCjhy83Q_nF85fKW9ZI-qYxfUa0BUU&e=> Thanks Yanbo On Tue, Sep 5, 2017 at 9:56 PM, Patrick McCarthy <pmccar...@dstillery.com<mailto:pmccar...@dstillery.com>> wrote: You might benefit from watching this JIRA issue - https://issues.apache.org/jira/browse/SPARK-19071<https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_SPARK-2D19071&d=DwMFaQ&c=shNJtf5dKgNcPZ6Yh64b-A&r=wnzquyZN5LCZ2v6jPXe4F2nU9j4v9g_t24s63U3cYqE&m=FtsbdcfaOELxFW8EFphZgjTd7cl3Kc5oYsQ558EZb3A&s=hQZ6ldug0XZvo4q87r0BQatn55B6UtyVVs0Ge9UneW4&e=> On Sun, Sep 3, 2017 at 5:50 PM, Timsina, Prem <prem.tims...@mssm.edu<mailto:prem.tims...@mssm.edu>> wrote: Is there a way to parallelize multiple ML algorithms in Spark. My use case is something like this: A) Run multiple machine learning algorithm (Naive Bayes, ANN, Random Forest, etc.) in parallel. 1) Validate each algorithm using 10-fold cross-validation B) Feed the output of step A) in second layer machine learning algorithm. My question is: Can we run multiple machine learning algorithm in step A in parallel? Can we do cross-validation in parallel? Like, run 10 iterations of Naive Bayes training in parallel? I was not able to find any way to run the different algorithm in parallel. And it seems cross-validation also can not be done in parallel. I appreciate any suggestion to parallelize this use case. Prem