Re: Apache Spark: Parallelization of Multiple Machine Learning ALgorithm

2017-09-05 Thread Bryan Cutler
Hi Prem,

Spark actually does somewhat support different algorithms in
CrossValidator, but it's not really obvious.  You basically need to make a
Pipeline and build a ParamGrid with different algorithms as stages.  Here
is an simple example:

val dt = new DecisionTreeClassifier()
.setLabelCol("label")
.setFeaturesCol("features")

val lr = new LogisticRegression()
.setLabelCol("label")
.setFeaturesCol("features")

val pipeline = new Pipeline()

val paramGrid = new ParamGridBuilder()
  .addGrid(pipeline.stages, Array(Array[PipelineStage](dt),
Array[PipelineStage](lr)))

val cv = new CrossValidator()
  .setEstimator(pipeline)
  .setEstimatorParamMaps(paramGrid)

Although adding more params in the grid can get a little complicated - I
discuss in detail here https://bryancutler.github.io/cv-pipelines/
As Patrick McCarthy mentioned, you might want to follow SPARK-19071 ,
specifically https://issues.apache.org/jira/browse/SPARK-19357 which
parallelizes model evaluation.

Bryan

On Tue, Sep 5, 2017 at 8:02 AM, Yanbo Liang <yblia...@gmail.com> wrote:

> You are right, native Spark MLlib CrossValidation can't run *different 
> *algorithms
> in parallel.
>
> Thanks
> Yanbo
>
> On Tue, Sep 5, 2017 at 10:56 PM, Timsina, Prem <prem.tims...@mssm.edu>
> wrote:
>
>> Hi Yanboo,
>>
>> Thank You, I very much appreciate your help.
>>
>> For the current use case, the data can fit into a single node. So,
>> spark-sklearn seems to be good choice.
>>
>>
>>
>> *I have  on question regarding this *
>>
>> *“If no, Spark MLlib provide CrossValidation which can run multiple
>> machine learning algorithms parallel on distributed dataset and do
>> parameter search.
>> FYI: https://spark.apache.org/docs/latest/ml-tuning.html#cross-validation
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__spark.apache.org_docs_latest_ml-2Dtuning.html-23cross-2Dvalidation=DwMFaQ=shNJtf5dKgNcPZ6Yh64b-A=wnzquyZN5LCZ2v6jPXe4F2nU9j4v9g_t24s63U3cYqE=FtsbdcfaOELxFW8EFphZgjTd7cl3Kc5oYsQ558EZb3A=lVvXRRGoh5uXJw-K246dNzogKEfb2yFYtxpTB9xxizo=>”*
>>
>> If I understand correctly, it can run parameter search for
>> cross-validation in parallel.
>>
>> However,  currently  Spark does not support  running multiple algorithms
>> (like Naïve Bayes,  Random Forest, etc.) in parallel. Am I correct?
>>
>> If not, could you please point me to some resources where they have run
>> multiple algorithms in parallel.
>>
>>
>>
>> Thank You very much. It is great help, I will try spark-sklearn.
>>
>> Prem
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> *From: *Yanbo Liang <yblia...@gmail.com>
>> *Date: *Tuesday, September 5, 2017 at 10:40 AM
>> *To: *Patrick McCarthy <pmccar...@dstillery.com>
>> *Cc: *"Timsina, Prem" <prem.tims...@mssm.edu>, "user@spark.apache.org" <
>> user@spark.apache.org>
>> *Subject: *Re: Apache Spark: Parallelization of Multiple Machine
>> Learning ALgorithm
>>
>>
>>
>> Hi Prem,
>>
>>
>>
>> How large is your dataset? Can it be fitted in a single node?
>>
>> If no, Spark MLlib provide CrossValidation which can run multiple machine
>> learning algorithms parallel on distributed dataset and do parameter
>> search. FYI: https://spark.apache.org/docs/latest/ml-tuning.html#cro
>> ss-validation
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__spark.apache.org_docs_latest_ml-2Dtuning.html-23cross-2Dvalidation=DwMFaQ=shNJtf5dKgNcPZ6Yh64b-A=wnzquyZN5LCZ2v6jPXe4F2nU9j4v9g_t24s63U3cYqE=FtsbdcfaOELxFW8EFphZgjTd7cl3Kc5oYsQ558EZb3A=lVvXRRGoh5uXJw-K246dNzogKEfb2yFYtxpTB9xxizo=>
>>
>> If yes, you can also try spark-sklearn, which can distribute multiple
>> model training(single node training with sklearn) across a distributed
>> cluster and do parameter search. FYI: https://github.com/databr
>> icks/spark-sklearn
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_databricks_spark-2Dsklearn=DwMFaQ=shNJtf5dKgNcPZ6Yh64b-A=wnzquyZN5LCZ2v6jPXe4F2nU9j4v9g_t24s63U3cYqE=FtsbdcfaOELxFW8EFphZgjTd7cl3Kc5oYsQ558EZb3A=JfciAow01oTIYYCjhy83Q_nF85fKW9ZI-qYxfUa0BUU=>
>>
>>
>>
>> Thanks
>>
>> Yanbo
>>
>>
>>
>> On Tue, Sep 5, 2017 at 9:56 PM, Patrick McCarthy <pmccar...@dstillery.com>
>> wrote:
>>
>> You might benefit from watching this JIRA issue -
>> https://issues.apache.org/jira/browse/SPARK-19071
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_SPARK-2D19071=DwMFaQ=shNJtf5dKgNcPZ6Y

Re: Apache Spark: Parallelization of Multiple Machine Learning ALgorithm

2017-09-05 Thread Yanbo Liang
You are right, native Spark MLlib CrossValidation can't run *different
*algorithms
in parallel.

Thanks
Yanbo

On Tue, Sep 5, 2017 at 10:56 PM, Timsina, Prem <prem.tims...@mssm.edu>
wrote:

> Hi Yanboo,
>
> Thank You, I very much appreciate your help.
>
> For the current use case, the data can fit into a single node. So,
> spark-sklearn seems to be good choice.
>
>
>
> *I have  on question regarding this *
>
> *“If no, Spark MLlib provide CrossValidation which can run multiple
> machine learning algorithms parallel on distributed dataset and do
> parameter search.
> FYI: https://spark.apache.org/docs/latest/ml-tuning.html#cross-validation
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__spark.apache.org_docs_latest_ml-2Dtuning.html-23cross-2Dvalidation=DwMFaQ=shNJtf5dKgNcPZ6Yh64b-A=wnzquyZN5LCZ2v6jPXe4F2nU9j4v9g_t24s63U3cYqE=FtsbdcfaOELxFW8EFphZgjTd7cl3Kc5oYsQ558EZb3A=lVvXRRGoh5uXJw-K246dNzogKEfb2yFYtxpTB9xxizo=>”*
>
> If I understand correctly, it can run parameter search for
> cross-validation in parallel.
>
> However,  currently  Spark does not support  running multiple algorithms
> (like Naïve Bayes,  Random Forest, etc.) in parallel. Am I correct?
>
> If not, could you please point me to some resources where they have run
> multiple algorithms in parallel.
>
>
>
> Thank You very much. It is great help, I will try spark-sklearn.
>
> Prem
>
>
>
>
>
>
>
>
>
> *From: *Yanbo Liang <yblia...@gmail.com>
> *Date: *Tuesday, September 5, 2017 at 10:40 AM
> *To: *Patrick McCarthy <pmccar...@dstillery.com>
> *Cc: *"Timsina, Prem" <prem.tims...@mssm.edu>, "user@spark.apache.org" <
> user@spark.apache.org>
> *Subject: *Re: Apache Spark: Parallelization of Multiple Machine Learning
> ALgorithm
>
>
>
> Hi Prem,
>
>
>
> How large is your dataset? Can it be fitted in a single node?
>
> If no, Spark MLlib provide CrossValidation which can run multiple machine
> learning algorithms parallel on distributed dataset and do parameter
> search. FYI: https://spark.apache.org/docs/latest/ml-tuning.html#
> cross-validation
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__spark.apache.org_docs_latest_ml-2Dtuning.html-23cross-2Dvalidation=DwMFaQ=shNJtf5dKgNcPZ6Yh64b-A=wnzquyZN5LCZ2v6jPXe4F2nU9j4v9g_t24s63U3cYqE=FtsbdcfaOELxFW8EFphZgjTd7cl3Kc5oYsQ558EZb3A=lVvXRRGoh5uXJw-K246dNzogKEfb2yFYtxpTB9xxizo=>
>
> If yes, you can also try spark-sklearn, which can distribute multiple
> model training(single node training with sklearn) across a distributed
> cluster and do parameter search. FYI: https://github.com/
> databricks/spark-sklearn
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_databricks_spark-2Dsklearn=DwMFaQ=shNJtf5dKgNcPZ6Yh64b-A=wnzquyZN5LCZ2v6jPXe4F2nU9j4v9g_t24s63U3cYqE=FtsbdcfaOELxFW8EFphZgjTd7cl3Kc5oYsQ558EZb3A=JfciAow01oTIYYCjhy83Q_nF85fKW9ZI-qYxfUa0BUU=>
>
>
>
> Thanks
>
> Yanbo
>
>
>
> On Tue, Sep 5, 2017 at 9:56 PM, Patrick McCarthy <pmccar...@dstillery.com>
> wrote:
>
> You might benefit from watching this JIRA issue -
> https://issues.apache.org/jira/browse/SPARK-19071
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_SPARK-2D19071=DwMFaQ=shNJtf5dKgNcPZ6Yh64b-A=wnzquyZN5LCZ2v6jPXe4F2nU9j4v9g_t24s63U3cYqE=FtsbdcfaOELxFW8EFphZgjTd7cl3Kc5oYsQ558EZb3A=hQZ6ldug0XZvo4q87r0BQatn55B6UtyVVs0Ge9UneW4=>
>
>
>
> On Sun, Sep 3, 2017 at 5:50 PM, Timsina, Prem <prem.tims...@mssm.edu>
> wrote:
>
> Is there a way to parallelize multiple ML algorithms in Spark. My use case
> is something like this:
>
> A) Run multiple machine learning algorithm (Naive Bayes, ANN, Random
> Forest, etc.) in parallel.
>
> 1) Validate each algorithm using 10-fold cross-validation
>
> B) Feed the output of step A) in second layer machine learning algorithm.
>
> My question is:
>
> Can we run multiple machine learning algorithm in step A in parallel?
>
> Can we do cross-validation in parallel? Like, run 10 iterations of Naive
> Bayes training in parallel?
>
>
>
> I was not able to find any way to run the different algorithm in parallel.
> And it seems cross-validation also can not be done in parallel.
>
> I appreciate any suggestion to parallelize this use case.
>
>
>
> Prem
>
>
>
>
>


Re: Apache Spark: Parallelization of Multiple Machine Learning ALgorithm

2017-09-05 Thread Timsina, Prem
Hi Yanboo,
Thank You, I very much appreciate your help.
For the current use case, the data can fit into a single node. So, 
spark-sklearn seems to be good choice.

I have  on question regarding this
“If no, Spark MLlib provide CrossValidation which can run multiple machine 
learning algorithms parallel on distributed dataset and do parameter search. 
FYI: 
https://spark.apache.org/docs/latest/ml-tuning.html#cross-validation<https://urldefense.proofpoint.com/v2/url?u=https-3A__spark.apache.org_docs_latest_ml-2Dtuning.html-23cross-2Dvalidation=DwMFaQ=shNJtf5dKgNcPZ6Yh64b-A=wnzquyZN5LCZ2v6jPXe4F2nU9j4v9g_t24s63U3cYqE=FtsbdcfaOELxFW8EFphZgjTd7cl3Kc5oYsQ558EZb3A=lVvXRRGoh5uXJw-K246dNzogKEfb2yFYtxpTB9xxizo=>”
If I understand correctly, it can run parameter search for cross-validation in 
parallel.
However,  currently  Spark does not support  running multiple algorithms (like 
Naïve Bayes,  Random Forest, etc.) in parallel. Am I correct?
If not, could you please point me to some resources where they have run 
multiple algorithms in parallel.

Thank You very much. It is great help, I will try spark-sklearn.
Prem




From: Yanbo Liang <yblia...@gmail.com>
Date: Tuesday, September 5, 2017 at 10:40 AM
To: Patrick McCarthy <pmccar...@dstillery.com>
Cc: "Timsina, Prem" <prem.tims...@mssm.edu>, "user@spark.apache.org" 
<user@spark.apache.org>
Subject: Re: Apache Spark: Parallelization of Multiple Machine Learning 
ALgorithm

Hi Prem,

How large is your dataset? Can it be fitted in a single node?
If no, Spark MLlib provide CrossValidation which can run multiple machine 
learning algorithms parallel on distributed dataset and do parameter search. 
FYI: 
https://spark.apache.org/docs/latest/ml-tuning.html#cross-validation<https://urldefense.proofpoint.com/v2/url?u=https-3A__spark.apache.org_docs_latest_ml-2Dtuning.html-23cross-2Dvalidation=DwMFaQ=shNJtf5dKgNcPZ6Yh64b-A=wnzquyZN5LCZ2v6jPXe4F2nU9j4v9g_t24s63U3cYqE=FtsbdcfaOELxFW8EFphZgjTd7cl3Kc5oYsQ558EZb3A=lVvXRRGoh5uXJw-K246dNzogKEfb2yFYtxpTB9xxizo=>
If yes, you can also try spark-sklearn, which can distribute multiple model 
training(single node training with sklearn) across a distributed cluster and do 
parameter search. FYI: 
https://github.com/databricks/spark-sklearn<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_databricks_spark-2Dsklearn=DwMFaQ=shNJtf5dKgNcPZ6Yh64b-A=wnzquyZN5LCZ2v6jPXe4F2nU9j4v9g_t24s63U3cYqE=FtsbdcfaOELxFW8EFphZgjTd7cl3Kc5oYsQ558EZb3A=JfciAow01oTIYYCjhy83Q_nF85fKW9ZI-qYxfUa0BUU=>

Thanks
Yanbo

On Tue, Sep 5, 2017 at 9:56 PM, Patrick McCarthy 
<pmccar...@dstillery.com<mailto:pmccar...@dstillery.com>> wrote:
You might benefit from watching this JIRA issue - 
https://issues.apache.org/jira/browse/SPARK-19071<https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_SPARK-2D19071=DwMFaQ=shNJtf5dKgNcPZ6Yh64b-A=wnzquyZN5LCZ2v6jPXe4F2nU9j4v9g_t24s63U3cYqE=FtsbdcfaOELxFW8EFphZgjTd7cl3Kc5oYsQ558EZb3A=hQZ6ldug0XZvo4q87r0BQatn55B6UtyVVs0Ge9UneW4=>

On Sun, Sep 3, 2017 at 5:50 PM, Timsina, Prem 
<prem.tims...@mssm.edu<mailto:prem.tims...@mssm.edu>> wrote:
Is there a way to parallelize multiple ML algorithms in Spark. My use case is 
something like this:
A) Run multiple machine learning algorithm (Naive Bayes, ANN, Random Forest, 
etc.) in parallel.
1) Validate each algorithm using 10-fold cross-validation
B) Feed the output of step A) in second layer machine learning algorithm.
My question is:
Can we run multiple machine learning algorithm in step A in parallel?
Can we do cross-validation in parallel? Like, run 10 iterations of Naive Bayes 
training in parallel?

I was not able to find any way to run the different algorithm in parallel. And 
it seems cross-validation also can not be done in parallel.
I appreciate any suggestion to parallelize this use case.

Prem




Re: Apache Spark: Parallelization of Multiple Machine Learning ALgorithm

2017-09-05 Thread Yanbo Liang
Hi Prem,

How large is your dataset? Can it be fitted in a single node?
If no, Spark MLlib provide CrossValidation which can run multiple machine
learning algorithms parallel on distributed dataset and do parameter
search. FYI:
https://spark.apache.org/docs/latest/ml-tuning.html#cross-validation
If yes, you can also try spark-sklearn, which can distribute multiple model
training(single node training with sklearn) across a distributed cluster
and do parameter search. FYI: https://github.com/databricks/spark-sklearn

Thanks
Yanbo

On Tue, Sep 5, 2017 at 9:56 PM, Patrick McCarthy 
wrote:

> You might benefit from watching this JIRA issue -
> https://issues.apache.org/jira/browse/SPARK-19071
>
> On Sun, Sep 3, 2017 at 5:50 PM, Timsina, Prem 
> wrote:
>
>> Is there a way to parallelize multiple ML algorithms in Spark. My use
>> case is something like this:
>>
>> A) Run multiple machine learning algorithm (Naive Bayes, ANN, Random
>> Forest, etc.) in parallel.
>>
>> 1) Validate each algorithm using 10-fold cross-validation
>>
>> B) Feed the output of step A) in second layer machine learning algorithm.
>>
>> My question is:
>>
>> Can we run multiple machine learning algorithm in step A in parallel?
>>
>> Can we do cross-validation in parallel? Like, run 10 iterations of Naive
>> Bayes training in parallel?
>>
>>
>>
>> I was not able to find any way to run the different algorithm in
>> parallel. And it seems cross-validation also can not be done in parallel.
>>
>> I appreciate any suggestion to parallelize this use case.
>>
>>
>>
>> Prem
>>
>
>


Re: Apache Spark: Parallelization of Multiple Machine Learning ALgorithm

2017-09-05 Thread Patrick McCarthy
You might benefit from watching this JIRA issue -
https://issues.apache.org/jira/browse/SPARK-19071

On Sun, Sep 3, 2017 at 5:50 PM, Timsina, Prem  wrote:

> Is there a way to parallelize multiple ML algorithms in Spark. My use case
> is something like this:
>
> A) Run multiple machine learning algorithm (Naive Bayes, ANN, Random
> Forest, etc.) in parallel.
>
> 1) Validate each algorithm using 10-fold cross-validation
>
> B) Feed the output of step A) in second layer machine learning algorithm.
>
> My question is:
>
> Can we run multiple machine learning algorithm in step A in parallel?
>
> Can we do cross-validation in parallel? Like, run 10 iterations of Naive
> Bayes training in parallel?
>
>
>
> I was not able to find any way to run the different algorithm in parallel.
> And it seems cross-validation also can not be done in parallel.
>
> I appreciate any suggestion to parallelize this use case.
>
>
>
> Prem
>


Apache Spark: Parallelization of Multiple Machine Learning ALgorithm

2017-09-03 Thread Timsina, Prem
Is there a way to parallelize multiple ML algorithms in Spark. My use case is 
something like this:
A) Run multiple machine learning algorithm (Naive Bayes, ANN, Random Forest, 
etc.) in parallel.
1) Validate each algorithm using 10-fold cross-validation
B) Feed the output of step A) in second layer machine learning algorithm.
My question is:
Can we run multiple machine learning algorithm in step A in parallel?
Can we do cross-validation in parallel? Like, run 10 iterations of Naive Bayes 
training in parallel?

I was not able to find any way to run the different algorithm in parallel. And 
it seems cross-validation also can not be done in parallel.
I appreciate any suggestion to parallelize this use case.

Prem


Apache Spark: Parallelization of Multiple Machine Learning ALgorithm

2017-09-03 Thread prtimsina
Is there a way to parallelize multiple ML algorithms in Spark. My use case is
something like this:
A) Run multiple machine learning algorithm (Naive Bayes, ANN, Random Forest,
etc.) in parallel.
 1) Validate each algorithm using 10-fold cross-validation 
B) Feed the output of step A) in second layer machine learning algorithm.
My question is:
Can we run multiple machine learning algorithm in step A in parallel?
Can we do cross-validation in parallel? Like, run 10 iterations of Naive
Bayes training in parallel?

I was not able to find any way to run the different algorithm in parallel.
And it seems cross-validation also can not be done in parallel. 
I appreciate any suggestion to parallelize this use case.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org