Re: mllib performance on cluster

2014-09-03 Thread Evan R. Sparks
I spoke with SK offline about this, it looks like the difference in timings came from the fact that he was training 100 models for 100 iterations and taking the total time (vs. my example which trains a single model for 100 iterations). I'm posting my response here, though, because I think it's wor

Re: mllib performance on cluster

2014-09-02 Thread Evan R. Sparks
Hmm... something is fishy here. That's a *really* small dataset for a spark job, so almost all your time will be spent in these overheads, but still you should be able to train a logistic regression model with the default options and 100 iterations in <1s on a single machine. Are you caching your

Re: mllib performance on cluster

2014-09-02 Thread SK
The dataset is quite small : 5.6 KB. It has 200 rows and 3 features, and 1 column of labels. From this dataset, I split 80% for training set and 20% for test set. The features are integer counts and labels are binary (1/0). thanks -- View this message in context: http://apache-spark-user-lis

Re: mllib performance on cluster

2014-09-02 Thread Bharath Mundlapudi
Those are interesting numbers. You haven't mentioned the dataset size in your thread. This is a classic example of scalability and performance assuming your baseline numbers are correct and you tuned correctly everything on your cluster. Putting my outside cap, there are multiple reasons for this,

Re: mllib performance on cluster

2014-09-02 Thread SK
NUm Iterations: For LR and SVM, I am using the default value of 100. All the other parameters also I am using the default values. I am pretty much reusing the code from BinaryClassification.scala. For Decision Tree, I dont see any parameter for number of iterations inthe example code, so I did

Re: mllib performance on cluster

2014-09-02 Thread Evan R. Sparks
Also - what hardware are you running the cluster on? And what is the local machine hardware? On Tue, Sep 2, 2014 at 11:57 AM, Evan R. Sparks wrote: > How many iterations are you running? Can you provide the exact details > about the size of the dataset? (how many data points, how many features)

Re: mllib performance on cluster

2014-09-02 Thread Evan R. Sparks
How many iterations are you running? Can you provide the exact details about the size of the dataset? (how many data points, how many features) Is this sparse or dense - and for the sparse case, how many non-zeroes? How many partitions is your data RDD? For very small datasets the scheduling overh