Re: mllib performance on cluster

2014-09-03 Thread Evan R. Sparks
I spoke with SK offline about this, it looks like the difference in timings
came from the fact that he was training 100 models for 100 iterations and
taking the total time (vs. my example which trains a single model for 100
iterations). I'm posting my response here, though, because I think it's
worth documenting:

Benchmarking on a dataset this small on this many cores is probably not
going to give you any meaningful information about how the algorithms scale
to "real" data problems.

In this case, you've thrown 200 cores at 5.6kb of data - 200
low-dimensional data points. The overheads of scheduling tasks, sending
them out to each worker, and network latencies between the nodes, which are
essentially fixed regardless of problem size are COMPLETELY dominating the
time spent computing - which in the first two cases is 9-10 flops per data
point and in the last case is a couple of array lookups and adds per data
point.

It would make a lot more sense to find or generate a dataset that's 10 or
100GB and see how performance scales there. You can do this with the code I
pasted earlier, just change the second, third, and fourth arguments to an
appropriate number of elements, dimensionality, and number of partitions
that matches the number of cores you have on your cluster.

In short, don't use a cluster unless you need one :).

Hope this helps!


On Tue, Sep 2, 2014 at 3:51 PM, SK  wrote:

> The dataset is quite small : 5.6 KB.  It has 200 rows and 3 features, and 1
> column of labels.  From this dataset, I split 80% for training set and 20%
> for test set. The features are integer counts and labels are binary (1/0).
>
> thanks
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/mllib-performance-on-cluster-tp13290p13311.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: mllib performance on cluster

2014-09-02 Thread Evan R. Sparks
Hmm... something is fishy here.

That's a *really* small dataset for a spark job, so almost all your time
will be spent in these overheads, but still you should be able to train a
logistic regression model with the default options and 100 iterations in
<1s on a single machine.
Are you caching your dataset before training the classifier on it? It's
possible that you're rereading it from disk (or across the internet, maybe)
on every iteration?

>From spark-shell:

import org.apache.spark.mllib.util.LogisticRegressionDataGenerator

val dat = LogisticRegressionDataGenerator.generateLogisticRDD(sc, 200, 3,
1e-4, 4, 0.2).cache()

println(dat.count()) //should give 200

import org.apache.spark.mllib.classification.LogisticRegressionWithSGD

val start = System.currentTimeMillis; val model =
LogisticRegressionWithSGD.train(dat, 100); val delta =
System.currentTimeMillis - start;

println(delta) //On my laptop, 863ms.








On Tue, Sep 2, 2014 at 3:51 PM, SK  wrote:

> The dataset is quite small : 5.6 KB.  It has 200 rows and 3 features, and 1
> column of labels.  From this dataset, I split 80% for training set and 20%
> for test set. The features are integer counts and labels are binary (1/0).
>
> thanks
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/mllib-performance-on-cluster-tp13290p13311.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: mllib performance on cluster

2014-09-02 Thread SK
The dataset is quite small : 5.6 KB.  It has 200 rows and 3 features, and 1
column of labels.  From this dataset, I split 80% for training set and 20%
for test set. The features are integer counts and labels are binary (1/0).

thanks



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/mllib-performance-on-cluster-tp13290p13311.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: mllib performance on cluster

2014-09-02 Thread Bharath Mundlapudi
Those are interesting numbers. You haven't mentioned the dataset size in
your thread. This is a classic example of scalability and performance
assuming your baseline numbers are correct and you tuned correctly
everything on your cluster.

Putting my outside cap, there are multiple reasons for this, we need to
look at all these parameters:
1. This could be an algorithm cost when we move to cluster
2. This could a scalability cost
3. Cluster not tuned well
4. Indeed, there is a problem/performance regression in the framework.






On Tue, Sep 2, 2014 at 1:12 PM, SK  wrote:

> NUm Iterations: For  LR and SVM, I am using the default value of 100.  All
> the other parameters also I am using the default values.  I am pretty much
> reusing the code from BinaryClassification.scala.  For Decision Tree, I
> dont
> see any parameter for number of iterations inthe example code, so I did not
> specify any. I am running each algorithm on my dataset 100 times and taking
> the average runtime.
>
> MY dataset is very dense (hardly any zeros). The labels are 1 and 0.
>
> I did not explicity specify the number of partitions. I did not see any
> code
> for this in the MLLib examples for BinaryClassification and DecisionTree.
>
> hardware:
> local: intel core i7 with 12 cores and 7.8 GB of which I am allocating 4GB
> for the executor memory. According to the application detail stats in the
> spark UI, the total memory consumed is around 1.5 GB.
>
> cluster: 10 nodes with a total of 320 cores, with 16GB per node. According
> to the application detail stats in the spark UI, the total memory consumed
> is around 95.5 GB.
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/mllib-performance-on-cluster-tp13290p13299.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: mllib performance on cluster

2014-09-02 Thread SK
NUm Iterations: For  LR and SVM, I am using the default value of 100.  All
the other parameters also I am using the default values.  I am pretty much
reusing the code from BinaryClassification.scala.  For Decision Tree, I dont
see any parameter for number of iterations inthe example code, so I did not
specify any. I am running each algorithm on my dataset 100 times and taking
the average runtime. 

MY dataset is very dense (hardly any zeros). The labels are 1 and 0. 

I did not explicity specify the number of partitions. I did not see any code
for this in the MLLib examples for BinaryClassification and DecisionTree.

hardware: 
local: intel core i7 with 12 cores and 7.8 GB of which I am allocating 4GB
for the executor memory. According to the application detail stats in the
spark UI, the total memory consumed is around 1.5 GB.

cluster: 10 nodes with a total of 320 cores, with 16GB per node. According
to the application detail stats in the spark UI, the total memory consumed
is around 95.5 GB.




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/mllib-performance-on-cluster-tp13290p13299.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: mllib performance on cluster

2014-09-02 Thread Evan R. Sparks
Also - what hardware are you running the cluster on? And what is the local
machine hardware?


On Tue, Sep 2, 2014 at 11:57 AM, Evan R. Sparks 
wrote:

> How many iterations are you running? Can you provide the exact details
> about the size of the dataset? (how many data points, how many features) Is
> this sparse or dense - and for the sparse case, how many non-zeroes? How
> many partitions is your data RDD?
>
> For very small datasets the scheduling overheads of shipping tasks across
> the cluster and delays due to stragglers can dominate the time actually
> doing your parallel computation. If you have too few partitions, you won't
> be taking advantage of cluster parallelism, and if you have too many you're
> introducing even more of the aforementioned overheads.
>
>
>
> On Tue, Sep 2, 2014 at 11:24 AM, SK  wrote:
>
>> Hi,
>>
>> I evaluated the runtime performance of some of the MLlib classification
>> algorithms on a local machine and a cluster with 10 nodes. I used
>> standalone
>> mode and Spark 1.0.1 in both cases. Here are the results for the total
>> runtime:
>>Local Cluster
>> Logistic regression   138 sec  336 sec
>> SVM   138 sec  336 sec
>> Decision tree 50 sec 132 sec
>>
>> My dataset is quite small and my programs are very similar to the mllib
>> examples that are included in the Spark distribution. Why is the runtime
>> on
>> the cluster significantly higher (almost 3 times) than that on the local
>> machine even though the former uses more memory and more nodes? Is it
>> because of the communication overhead on the cluster? I would like to know
>> if there is something I need to be doing to optimize the performance on
>> the
>> cluster or if others have also been getting similar results.
>>
>> thanks
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/mllib-performance-on-cluster-tp13290.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


Re: mllib performance on cluster

2014-09-02 Thread Evan R. Sparks
How many iterations are you running? Can you provide the exact details
about the size of the dataset? (how many data points, how many features) Is
this sparse or dense - and for the sparse case, how many non-zeroes? How
many partitions is your data RDD?

For very small datasets the scheduling overheads of shipping tasks across
the cluster and delays due to stragglers can dominate the time actually
doing your parallel computation. If you have too few partitions, you won't
be taking advantage of cluster parallelism, and if you have too many you're
introducing even more of the aforementioned overheads.



On Tue, Sep 2, 2014 at 11:24 AM, SK  wrote:

> Hi,
>
> I evaluated the runtime performance of some of the MLlib classification
> algorithms on a local machine and a cluster with 10 nodes. I used
> standalone
> mode and Spark 1.0.1 in both cases. Here are the results for the total
> runtime:
>Local Cluster
> Logistic regression   138 sec  336 sec
> SVM   138 sec  336 sec
> Decision tree 50 sec 132 sec
>
> My dataset is quite small and my programs are very similar to the mllib
> examples that are included in the Spark distribution. Why is the runtime on
> the cluster significantly higher (almost 3 times) than that on the local
> machine even though the former uses more memory and more nodes? Is it
> because of the communication overhead on the cluster? I would like to know
> if there is something I need to be doing to optimize the performance on the
> cluster or if others have also been getting similar results.
>
> thanks
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/mllib-performance-on-cluster-tp13290.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>