Hmm... something is fishy here. That's a *really* small dataset for a spark job, so almost all your time will be spent in these overheads, but still you should be able to train a logistic regression model with the default options and 100 iterations in <1s on a single machine. Are you caching your dataset before training the classifier on it? It's possible that you're rereading it from disk (or across the internet, maybe) on every iteration?
>From spark-shell: import org.apache.spark.mllib.util.LogisticRegressionDataGenerator val dat = LogisticRegressionDataGenerator.generateLogisticRDD(sc, 200, 3, 1e-4, 4, 0.2).cache() println(dat.count()) //should give 200 import org.apache.spark.mllib.classification.LogisticRegressionWithSGD val start = System.currentTimeMillis; val model = LogisticRegressionWithSGD.train(dat, 100); val delta = System.currentTimeMillis - start; println(delta) //On my laptop, 863ms. On Tue, Sep 2, 2014 at 3:51 PM, SK <skrishna...@gmail.com> wrote: > The dataset is quite small : 5.6 KB. It has 200 rows and 3 features, and 1 > column of labels. From this dataset, I split 80% for training set and 20% > for test set. The features are integer counts and labels are binary (1/0). > > thanks > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/mllib-performance-on-cluster-tp13290p13311.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >