Re: mllib performance on cluster
I spoke with SK offline about this, it looks like the difference in timings came from the fact that he was training 100 models for 100 iterations and taking the total time (vs. my example which trains a single model for 100 iterations). I'm posting my response here, though, because I think it's worth documenting: Benchmarking on a dataset this small on this many cores is probably not going to give you any meaningful information about how the algorithms scale to "real" data problems. In this case, you've thrown 200 cores at 5.6kb of data - 200 low-dimensional data points. The overheads of scheduling tasks, sending them out to each worker, and network latencies between the nodes, which are essentially fixed regardless of problem size are COMPLETELY dominating the time spent computing - which in the first two cases is 9-10 flops per data point and in the last case is a couple of array lookups and adds per data point. It would make a lot more sense to find or generate a dataset that's 10 or 100GB and see how performance scales there. You can do this with the code I pasted earlier, just change the second, third, and fourth arguments to an appropriate number of elements, dimensionality, and number of partitions that matches the number of cores you have on your cluster. In short, don't use a cluster unless you need one :). Hope this helps! On Tue, Sep 2, 2014 at 3:51 PM, SK wrote: > The dataset is quite small : 5.6 KB. It has 200 rows and 3 features, and 1 > column of labels. From this dataset, I split 80% for training set and 20% > for test set. The features are integer counts and labels are binary (1/0). > > thanks > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/mllib-performance-on-cluster-tp13290p13311.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >
Re: mllib performance on cluster
Hmm... something is fishy here. That's a *really* small dataset for a spark job, so almost all your time will be spent in these overheads, but still you should be able to train a logistic regression model with the default options and 100 iterations in <1s on a single machine. Are you caching your dataset before training the classifier on it? It's possible that you're rereading it from disk (or across the internet, maybe) on every iteration? >From spark-shell: import org.apache.spark.mllib.util.LogisticRegressionDataGenerator val dat = LogisticRegressionDataGenerator.generateLogisticRDD(sc, 200, 3, 1e-4, 4, 0.2).cache() println(dat.count()) //should give 200 import org.apache.spark.mllib.classification.LogisticRegressionWithSGD val start = System.currentTimeMillis; val model = LogisticRegressionWithSGD.train(dat, 100); val delta = System.currentTimeMillis - start; println(delta) //On my laptop, 863ms. On Tue, Sep 2, 2014 at 3:51 PM, SK wrote: > The dataset is quite small : 5.6 KB. It has 200 rows and 3 features, and 1 > column of labels. From this dataset, I split 80% for training set and 20% > for test set. The features are integer counts and labels are binary (1/0). > > thanks > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/mllib-performance-on-cluster-tp13290p13311.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >
Re: mllib performance on cluster
The dataset is quite small : 5.6 KB. It has 200 rows and 3 features, and 1 column of labels. From this dataset, I split 80% for training set and 20% for test set. The features are integer counts and labels are binary (1/0). thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/mllib-performance-on-cluster-tp13290p13311.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: mllib performance on cluster
Those are interesting numbers. You haven't mentioned the dataset size in your thread. This is a classic example of scalability and performance assuming your baseline numbers are correct and you tuned correctly everything on your cluster. Putting my outside cap, there are multiple reasons for this, we need to look at all these parameters: 1. This could be an algorithm cost when we move to cluster 2. This could a scalability cost 3. Cluster not tuned well 4. Indeed, there is a problem/performance regression in the framework. On Tue, Sep 2, 2014 at 1:12 PM, SK wrote: > NUm Iterations: For LR and SVM, I am using the default value of 100. All > the other parameters also I am using the default values. I am pretty much > reusing the code from BinaryClassification.scala. For Decision Tree, I > dont > see any parameter for number of iterations inthe example code, so I did not > specify any. I am running each algorithm on my dataset 100 times and taking > the average runtime. > > MY dataset is very dense (hardly any zeros). The labels are 1 and 0. > > I did not explicity specify the number of partitions. I did not see any > code > for this in the MLLib examples for BinaryClassification and DecisionTree. > > hardware: > local: intel core i7 with 12 cores and 7.8 GB of which I am allocating 4GB > for the executor memory. According to the application detail stats in the > spark UI, the total memory consumed is around 1.5 GB. > > cluster: 10 nodes with a total of 320 cores, with 16GB per node. According > to the application detail stats in the spark UI, the total memory consumed > is around 95.5 GB. > > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/mllib-performance-on-cluster-tp13290p13299.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >
Re: mllib performance on cluster
NUm Iterations: For LR and SVM, I am using the default value of 100. All the other parameters also I am using the default values. I am pretty much reusing the code from BinaryClassification.scala. For Decision Tree, I dont see any parameter for number of iterations inthe example code, so I did not specify any. I am running each algorithm on my dataset 100 times and taking the average runtime. MY dataset is very dense (hardly any zeros). The labels are 1 and 0. I did not explicity specify the number of partitions. I did not see any code for this in the MLLib examples for BinaryClassification and DecisionTree. hardware: local: intel core i7 with 12 cores and 7.8 GB of which I am allocating 4GB for the executor memory. According to the application detail stats in the spark UI, the total memory consumed is around 1.5 GB. cluster: 10 nodes with a total of 320 cores, with 16GB per node. According to the application detail stats in the spark UI, the total memory consumed is around 95.5 GB. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/mllib-performance-on-cluster-tp13290p13299.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: mllib performance on cluster
Also - what hardware are you running the cluster on? And what is the local machine hardware? On Tue, Sep 2, 2014 at 11:57 AM, Evan R. Sparks wrote: > How many iterations are you running? Can you provide the exact details > about the size of the dataset? (how many data points, how many features) Is > this sparse or dense - and for the sparse case, how many non-zeroes? How > many partitions is your data RDD? > > For very small datasets the scheduling overheads of shipping tasks across > the cluster and delays due to stragglers can dominate the time actually > doing your parallel computation. If you have too few partitions, you won't > be taking advantage of cluster parallelism, and if you have too many you're > introducing even more of the aforementioned overheads. > > > > On Tue, Sep 2, 2014 at 11:24 AM, SK wrote: > >> Hi, >> >> I evaluated the runtime performance of some of the MLlib classification >> algorithms on a local machine and a cluster with 10 nodes. I used >> standalone >> mode and Spark 1.0.1 in both cases. Here are the results for the total >> runtime: >>Local Cluster >> Logistic regression 138 sec 336 sec >> SVM 138 sec 336 sec >> Decision tree 50 sec 132 sec >> >> My dataset is quite small and my programs are very similar to the mllib >> examples that are included in the Spark distribution. Why is the runtime >> on >> the cluster significantly higher (almost 3 times) than that on the local >> machine even though the former uses more memory and more nodes? Is it >> because of the communication overhead on the cluster? I would like to know >> if there is something I need to be doing to optimize the performance on >> the >> cluster or if others have also been getting similar results. >> >> thanks >> >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/mllib-performance-on-cluster-tp13290.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> - >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> >> >
Re: mllib performance on cluster
How many iterations are you running? Can you provide the exact details about the size of the dataset? (how many data points, how many features) Is this sparse or dense - and for the sparse case, how many non-zeroes? How many partitions is your data RDD? For very small datasets the scheduling overheads of shipping tasks across the cluster and delays due to stragglers can dominate the time actually doing your parallel computation. If you have too few partitions, you won't be taking advantage of cluster parallelism, and if you have too many you're introducing even more of the aforementioned overheads. On Tue, Sep 2, 2014 at 11:24 AM, SK wrote: > Hi, > > I evaluated the runtime performance of some of the MLlib classification > algorithms on a local machine and a cluster with 10 nodes. I used > standalone > mode and Spark 1.0.1 in both cases. Here are the results for the total > runtime: >Local Cluster > Logistic regression 138 sec 336 sec > SVM 138 sec 336 sec > Decision tree 50 sec 132 sec > > My dataset is quite small and my programs are very similar to the mllib > examples that are included in the Spark distribution. Why is the runtime on > the cluster significantly higher (almost 3 times) than that on the local > machine even though the former uses more memory and more nodes? Is it > because of the communication overhead on the cluster? I would like to know > if there is something I need to be doing to optimize the performance on the > cluster or if others have also been getting similar results. > > thanks > > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/mllib-performance-on-cluster-tp13290.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >